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P.O. Box 1450 
Alexandria, VA 22313-1450 

I, James A. Flight, hereby declare and state: 

1 . I am attorney of record for the above-referenced patent application. 

2. The United States Patent & Trademark Office issued an Office 
action dated March 21, 2006, purporting to reject all pending claims of the 
above-referenced patent application based on Sherwood et al., Phase 
Tracking and Prediction, technical report CS2002-0710, UC San Diego, 
June 2002. That technical report can be found on the Internet at 
http://www.cse.ucsd.edU/Dienst/UI/2.0/Describe/ncstrl.ucsd cse/CS2002- 
0710 . A copy of the technical report is attached hereto as Exhibit 1. 
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3. As shown in Exhibit 1, the technical report includes two links. 
Both of those links provide the ability to download a 10Kb postscript file. 
A copy of that postscript file is attached hereto as Exhibit 2. 

4. Exhibit 2 states "To obtain a copy of this techreport, please look 
for it at the following site: 

http://www-cse.ucsd.edit/users/calder/papers.html 
Or send email or letter to: 

Brad Calder... calder@cs.ucsd.edu " Following the link in Exhibit 2 leads 
to Exhibit 3 . 

5 . Exhibit 3 does not list the technical report as published in 2002. 
However, it identifies a corresponding paper as published in June of 2003, 
namely, Timothy Sherwood, Suleyman Sair, and Brad Calder, Phase 
Tracking and Prediction. 30th International Symposium on Computer 
Architecture, June 2003. The PDF of this June of 2003 paper is linked in 
Exhibit 3. A copy of this June of 2003 paper is attached hereto as Exhibit 
4. Exhibit 4 identifies its date of publication as June of 2003. 

6. The Office action of March 21, 2006 purports to reject the claims 
noted in paragraph 2 above based on the technical report of 2002. 
However, it actually relies on the content of the June of 2003 publication 
(i.e., Exhibit 4) to support the rejections. As noted above, the technical 
report of 2002 (Exhibit 1) includes only an abstract. It does not include 
Exhibit 4. 

7. Because of the evident inconsistency between Exhibit 1 and 
Exhibit 4, 1 sent an email to Timothy Sherwood, the first listed author in 
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Exhibits 1 and 4 on October 30, 2006 asking Mr. Sherwood to identify the 
correct date of publication for the full article. As shown in Exhibit 5, Mr. 
Sherwood replied to my email on October 30, 2006 by stating: 
Hi Jim, 

The publication appeared in ISCA 2003 and so was officially 
published on June 9th, 2003. 

http://cs.nyu.edu/isca03/ 

-Tim 

8. The citation in Mr. Sherwood's email (Exhibit 5) is to the program 
announcement shown in Exhibit 6. That program announcement states 
"The thirtieth International Symposium on Computer Architecture (ISCA) 
will be held at the Town and Country Hotel in San Diego 9-1 1 June, 
2003." Thus, Mr. Sherwood, the first named author of Exhibits 1 and 4 
indicated that the full publication was made available in June of 2003, 
which is after the April 28, 2003 filing date of the instant application. 

9. Having no reason whatsoever to doubt Mr. Sherwood's testimony, 
I prepared and filed a response to the Office action dated March 21, 2006 
attaching a copy of Mr. Sherwood's email and explaining that, based on 
the author's testimony, the full Sherwood article (i.e., Exhibit 4) is not 
prior art to the instant application. 

10. The USPTO issued a final Office action on May 16, 2007 refusing 
to accept the author's testimony because it was not in the form of an 
affidavit or declaration. Accordingly, I am submitting this requested 
declaration to verify the source of the emails noted above. 

1 1 . The final Office action attempts to locate evidence that the author, 
Mr. Sherwood, is incorrect in his belief that the publication of Exhibit 4 

-3- 



U.S. Serial No. ^424,356 
Rule 131 Declaration 



did not occur until after the filing date of the instant application. In 
particular, the final Office action cites four pieces of evidence to allegedly 
support an earlier publication date for the full article. 

12. First, the final Office action cites the publication "ACM SIGARCH 
Computer Architecture News, archive volume 31, Issue 2 (May 2003), 
pages 336-349, ISSN:0 163-5964" (attached hereto as Exhibit 7) as 
evidencing publication prior to June of 2003. Exhibit 7 does indeed note 
that the Sherwood article was published in May of 2003. However, May 
of 2003 is still after the April 28, 2003 filing date of this application. 
Therefore, Exhibit 7 fails to make the full Sherwood article (Exhibit 4) 
prior art to this application. A copy of the Sherwood article as linked to in 
Exhibit 7 is attached hereto as Exhibit 8. 

13. The final Office action also cites to Exhibits 9 and 10 (reports by 
year and author, respectively) which identify the date of the technical 
report (i.e., Exhibit 1) as June 23, 2002, However, as noted above, the 
technical report is not the full article (i.e., it is not Exhibits 4 or 8) and, 
thus, Exhibits 9 and 10 only serve to document Exhibit 1 as prior art, not 
the full article (i.e., not Exhibits 4 or 8). 

14. Finally, the Office action cites to footnote 20 of "Automatically, 
Characterizing Large Scale Program Behavior," 2002, ISBN 1-581 13-574- 
2 (Exhibit 1 1) as evidence of the availability of the technical report (i.e., 
Exhibit 1), Again, however, Exhibit 1 is not Exhibits 4 or 8. and, thus, the 
demonstration of the 2002 publication of the technical report does not 
make Exhibits 4 or 8 prior art. Thus, the evidence of record relied upon by 
the final Office action does not demonstrate Exhibits 4 and/or 8 to be prior 
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art to the instant application. Accordingly, the final Office action's 
reliance on the content of Exhibits 4 and/or 8 to reject the claims is 
improper. 

15. To attempt to determine the earliest publication date of the full 
article (i.e., Exhibits 4 and/or 8), the undersigned searched the Internet 
Archive Wayback Machine (http://www.archive.org/index.php) for the 
technical report (i.e., Exhibit 1) . As shown in Exhibit 12, that search 
demonstrated that only an abstract of the Sherwood article (i.e., only 
Exhibit I) was present on the Internet as of November 19, 2002. In 
particular, as shown in Exhibit 13 attached hereto (i.e., the printout from 
the November 19, 2002 link of Exhibit 12), only a single paragraph and 
not the full Sherwood article (i.e., not Exhibits 4 and/or 8) could be 
accessed as of the November 19, 2002. Exhibit 14, which is the printout 
from the July 9, 2003 link of Exhibit 12, is identical to the November 19, 
2002 information. Therefore, only Exhibit 1 has been shown to be prior 
art. There is no evidence of record that the full Sherwood article (i.e., 
Exhibits 4 and/or 8) is prior art to this application. 

16. As noted above, Exhibit 2 invited the public to contact Brad Calder 
"to obtain a copy of the techreport." Therefore, there is a possibility that 
Mr. Calder was providing something beyond the content of Exhibit 1 to 
requestors prior to the filing date of this application. Accordingly, to 
determine if anything more than Exhibit 1 is prior art to the instant 
application, I sent the email attached hereto as Exhibit 15 to Mr. Calder on 
July 7, 2007 asking Mr. Calder to clarify the situation. Having had no 
response, I again sent the email attached hereto as Exhibit 16 to Mr. 



-5- 



U.S. Serial No. ^424,356 



Rule 131 Declaration 

Calder. As shown at Exhibit 17, Exhibit 16 was delivered to Mr. Calder. 
To date, he has made no response. 



that serves Exhibit 1 to the Internet asking for clarification as to the 
situation and to obtain the postscript file attached hereto as Exhibit 2. As 
shown in Exhibit 18, the webmaster responded by re-booting the server to 
make Exhibit 2 available, and by referring to Exhibits 9 and 10. As 
discussed above, none of this shows the full article (i.e., Exhibits 4 and/or 
8) to be prior art. Therefore, despite all efforts by the undersigned and the 
Examiner to date, nothing of record indicates that anything beyond Exhibit 
1 is prior art to the instant application. 
18. I understand that willful and false statements and the like are 

punishable by a fine and/or imprisonment under 18 U.S. C. § 1001, and 
that such willful false statement may jeopardize the validity of this 
application and any patent resulting therefrom. 



17. 



I also sent a request for clarification to the webmaster for the server 



Date: July 16, 2007 
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Phase Tracking and Prediction 

Timothy Sherwood, Suleyman Sair and Brad Calder 

CS2002-0710 

June 23, 2002 

In a single second a modern processor can execute billions of instructions. Obtaining a bird's eye view 
of the behavior of a program at these speeds can be a difficult task when all that is available is cycle by 
cycle examination. In many programs, behavior is anything but steady state, and understanding the 
patterns of behavior at run-time can unlock a multitude of optimization opportunities. In this paper we 
present a unified profiling architecture that can efficiently capture, classify, and predict program 
behavior on the largest of time scales all at run-time with no support from software. By examining the 
proportion of instructions that were executed from different sections of code, we can find generic phases 
that correspond to changes in behavior across many metrics. By classifying phases generically, we avoid 
the need to identify phases for each optimization, and enable a unified prediction scheme that can 
forecast future behavior. We examine the ability of our phase tracking architecture to accurately capture 
the phase behavior of a program's execution with respect to its overall performance (IPC), branch 
prediction, cache performance, and energy, and show how phase behavior may be captured efficiently 
using a simple predictor. 



How to view this document 

• Display the whole document in one of the following formats. 

o PostScrip t 10012 bvtes. 

• Print or download all or selected pages. 



The authors of these documents have submitted their reports to this technical report series 
for the purpose of non-commercial dissemination of scientific work. The reports are 
copyrighted by the authors, and their existence in electronic format does not imply that the 
authors have relinquished any rights. You may copy a report for scholarly, non-commercial 
purposes, such as research or instruction, provided that you agree to respect the author's 
copyright. For information concerning the use of this document for other than research or 
instructional purposes, contact the authors. Other information concerning this technical 
report series can be obtained from the Computer Science and Engineering Department at the 
University of California at San Diego, techreports@cs.ucsd.edu. 
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Techreport 



To obtain a copy of this techreport, please 
look for it at the following site: 

http://vvww-cse.ucsd.edu/users/calder/papers.html 

Or send email or a letter to: 

Brad C alder 

University of California, San Diego 

9500 Gilman Drive 

La Jolla, CA 92093-0114 

calder@cs.ucsd.edu 
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Publications 



The documents contained in these directories have been provided by the contributing authors as a means 
to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright 
and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that 
they have offered their works here electronically. It is understood that all persons copying this 
information will adhere to the terms and constraints invoked by each author's copyright. These works 
may not be reposted without the explicit permission of the copyright holder. 



• Thesis 



Publications are listed below by year of publication. For a list of papers by category, please click here . 
Book Chapters 

• Brad Calder, Timothy Sherwood, Greg Hamerly, and Erez Pereiman, SimPoint: Picking 
Representative Samples to Guide Simulation. Chapter 7 in the book "Performance Evaluation 
and Benchmarking" edited by Lizy Kurian John and Lieven Eeckhout; published by CRC Press, 
2005 

2007 

• Satish Narayanasamy, Zhenghao Wang, Jordan Tigani, Andrew Edwards and Brad Calder . 
Automatically Classifying Benign and Harmful Data Races Using Replay Analysis , ACM 
SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2007 

CpM 

• Erez Pereiman, Jeremy Lau, Harish Patil, Aamer Jaleel, Greg Hamerly, and Brad Calder . Cross 
Binary Simulation Points . International Symposium on Performance Analysis of Systems and 
Software (ISPASS), April 2007 ( pdt) 

• Satish Narayanasamy, Ayse Coskun, and Brad Calder . Transient Fault Prediction Based on 
Anomalies in Processor Events , Design Automation and Test in Europe (DATE), April 2007 

(Bdf> 

• Weifeng Zhang, Brad Calder and Dean Tullsen . Accelerating and Adapting Precomputation 
Threads for Efficient Prefetching . In the International Symposium on High Performance 
Computer Architecture, February 2007. f pdf) 

• Wei Chuang, Satish Narayanasamy, and Brad Calder . Bounds Checking with Taint-Based 
Analysis . International Conference on High Performance Embedded Architectures & Compilers, 
January 2007 fpdf) 

2006 



http://www-cse.ucsd.edu/users/calder/papers.html 



7/16/2007 



Publications 



Page 2 of 1 1 



• Jack Sampson, Ruben Gonzalez, Jean-Francois Collard, Norm Jouppi, Mike Schlansker and Brad 
Calder , Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast 
Barriers . In proceedings of the International Symposium on Microarchitecture (MICRO), 
December 2006. (pdf) 

• Wei Chuang, Satish Narayanasamy, Ganesh Venkatesh, Jack Sampson, Michael Van Biesbrouck, 
Gilles Pokam, Osvaldo Colavin, and Brad Calder . Unbounded Page-Based Transactional 
Memory . International Conference on Architectural Support for Programming Languages and 
Operating Systems (ASPLOS), Oct 2006 (pdf) 

• Satish Narayanasamy, Cristiano Pereira and Brad Calder . Recording Shared Memory 
Dependencies Using Strata , International Conference on Architectural Support for 
Prograrrirning Languages and Operating Systems (ASPLOS), October 2006 fpdfj 

• Satish Narayanasamy, Bruce Carneal and Brad Calder . Patching Processor Design Errors . 
International Conference on Computer Design, October 2006 ( pdf) 

• Weifeng Zhang, Steve Checkoway, Brad Calder, and Dean Tullsen . Speculative Code Value 
S pecialization Using the Trace Cache Fill Unit . International Conference on Computer Design, 
Oct 2006 ( pdf) 

• Satish Narayanasamy, Cristiano Pereira and Brad Calder, . Software Profiling for Deterministic 
Re play Debug gin g of User Code , The 5th International Conference on Software Methodologies, 
Tools and Techniques (SoMeT), October 2006 (pdf) 

• Michael Van Biesbrouck, Lieven Eeckhout, and Brad Calder, Efficient Sampling Startup for 
SimPoint . IEEE Micro Special Issue Jul/Aug 06 Modeling & Simulation ( pdf) 

• Jeremy Lau, Matt Arnold, Micheal Hind, and Brad Calder , Online Performance Auditing : 
Usin g Hot Optimizations Without Getting Burned . ACM SIGPLAN Conference on 
Progra mmin g Language Design and Implementation (PLDI), June 2006 ( pdf) 

• Satish Narayanasamy, Cristiano Pereira, Harish Patil, Robert Cohn, and Brad Calder . Automatic 
Loggin g of Operating System Effects to Guide Application Level Architecture Simulation . 

ACM Sigmetrics International Conference on Measurement and Modeling of Computer Systems 
(Sigmetrics), June 2006 ( pdf) 

• Greg Hamerly, Erez Perelman, Jeremy Lau, Brad Calder, and Timothy Sherwood . Using 
Machine Learning to Guide Architecture Simulation . Journal of Machine Learning Research 
(JMLR), 2006 ( pdf) 

• Erez Perelman, Marzia Polito, Jean- Yves Bouguet, John Sampson, Brad Calder, Carole Dulong 4 
Detecting Phases in Parallel Applications on Shared Memory Architectures . IEEE 
International Parallel and Distributed Processing Symposium (IPDPS), April 2006. £pdf) 

• Greg Hamerly, Erez Perelman, and Brad Calder, Comparing Multinomial and K-Means 
Clustering for SimPoint . International Symposium on Performance Analysis of Systems and 
Software (ISPASS), March 2006. ( pdf) 

• Michael Van Biesbrouck, Lieven Eeckhout, and Brad Calder, Considering All Starting Points 
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for Simultaneous Multithreading Simulation , International Symposium on Performance 
Analysis of Systems and Software (ISPASS), March 2006. fpdf) 

• Jeremy Lau, Erez Perelman, and Brad Calder, Selecting Software Phase Markers with Code 
Structure Analysis . International Symposium on Code Generation and Optimization (CGO), 
March 2006. ( pdf) 

• Weifeng Zhang, Brad Calder and Dean Tullsen, A Self Repairing Prefetcher in an Event- 
Driven Dynamic Optimization Fram ework , International Symposium on Code Generation and 
Optimization (CGO), March 2006. f pdf) 

2005 

• Satish Narayanasamy, Gilles Pokam, and Brad Calder BugNet: Recording A pplication Level 
Execution for Deterministic Replay Debug ging , IEEE Micro: Micro's Top Picks from 
Computer Architecture Conferences, December 2005 (pdf) 

• Michael Van Biesbrouck, Lieven Eeckhout, and Brad Calder, Efficient Sampling Startup for 
Sampled Processor Simulation . 2005 International Conference on High Performance Embedded 
Architectures & Compilers, November 2005. ( pdf) 

• Bengu Li, Ganesh Venkatesh, Brad Calder, and Rajiv Gupta, Exploiting a Computation Re use 
Cache to Reduce Energy in Network Processors . 2005 International Conference on High 
Performance Embedded Architectures & Compilers, November 2005. ( pdf) 

• Lieven Eeckhout, John Sampson, and Brad Calder, Ex ploiting Program Microarchitecture 
Independent Characteristics and Phase Behavior for Reduced Benchmark Suite Simulation . 

2005 IEEE International Symposium on Workload Characterization, October 2005 ( pdf) 

• Cristiano Pereira, Jeremy Lau, Brad Calder, and Rajesh Gupta, Dynamic Phase Analysis for 
Cycle-Close Trace Generation . International Conference on Hardware/Software Codesign and 
System Synthesis, September 2005 ( pdf) 

• Weifeng Zhang, Brad Calder, and Dean Tullsen, An Event-Driven Multithreaded Dynamic 
O ptimization Framework , International Conference on Parallel Arcliitectures and Compilation 
Techniques, September 2005. f pdf) 

• Erez Perelman, Trishul Chilimbi, and Brad Calder, Variational Path Profiling , International 
Conference on Parallel Architectures and Compilation Techniques, September 2005. £pdf) 

• Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad Calder, SimPoint 3.0: Faster and More 
Flexible Program Analysis . Journal of Instruction Level Parallelism, September 2005. ( pdf) 

• Brad Calder, Andrew Chien, Ju Wang and Don Yang, The Entropia Virtual Machine for 
Desktop Grids . International Conference on Virtual Execution Environments, June 2005. ( pdf) 

• Satish Narayanasamy, Gilles Pokam, and Brad Calder, Bu gNet: Continuously Recording 
Pro gram Execution for Deterministic Replay Debug ging . International Symposium on 
Computer Architecture, June 2005. (pdf) 
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• Satish Narayanasamy, Hong Wang, Perry Wang, John Shen, Brad Calder, A Dependency Chain 
Clustered Microarchitecture , IEEE International Parallel and Distributed Processing 
Symposium, April 2005. ( pdf) 

• Jeremy Lau, Jack Sampson, Erez Perelman, Greg Hamerly, and Brad Calder, The Strong 
Correlation between Code Signatures and Performance . IEEE International Symposium on 
Performance Analysis of Systems and Software, March 2005. ( pdf) 

• Jeremy Lau, Erez Perelman, Greg Hamerly, Timothy Sherwood, and Brad Calder, Motivation for 
Variable Length Intervals and Hierarchical Phase Behavior . IEEE International Symposium 
on Performance Analysis of Systems and Software, March 2005. ( pdf) 

• Jeremy Lau, Stefan Schoenmackers, and Brad Calder, Transition Phase Classifica tion and 
Prediction . In the 1 1th International Symposium on High Performance Computer Architecture, 
February 2005. ( pdf) 

2004 

• Nathan Tuck, Brad Calder and George Varghese, . Hardware and Binary Modification Su p port 
for Code Pointer Protection From Buffer Overflow . 37th International Symposium on 
Microarchitecture, December 2004. ( pdf) 

• Eric Tune, Rakesh Kumar, Dean Tullsen and Brad Calder . Balanced Multithreading: 
Increasing Throug h put via a Low Cost Multithreading Hierarchy , 37th International 
Symposium on Microarchitecture, December 2004. (pdf) 

• Timothy Sherwood, Mark Oskin, and Brad Calder . Balancing Design Options with Sherp a . 

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems 
(CASES), September 2004, ( pdf) 

• Brad Calder, Todd Austin, Don Yang, Timothy Sherwood, Suleyman Sair, David Newquist and 
Tim Cusac , BitRaker Anvil: Binary Instrumentation for Rapid Creation of Simulatiqnan d 
Workload Analysis Tools , Proceedings of Global Signal Processing (GSPx), September, 2004, 
(pdfi 

• Greg Hamerly, Erez Perelman, and Brad Calder, How to Use SimPoint to Pick Simulation 
Points ACM SIGMETRICS Performance Evaluation Review, Volume 31(4), March 2004 (pdfi 

• Glenn Reiman and Brad Calder, Using a Serial Cache for Energy Efficient Instruction 
Fetching , Journal of Systems Architecture, 2004, ( pdf) 

• Jeremy Lau, Stefan Schoenmackers, and Brad Calder, Structures for Phase Classification, 2 004 
IEEE International Symposium on Performance Analysis of Systems and Software, March 2004 
(pdf). 

• Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder, A Co-Phase Matrix to Guide 
Simultaneous Multithreading Simulation, 2004 IEEE International Symposium on Performance 
Analysis of Systems and Software, March 2004 (pdf) 

• Nathan Tuck, Timothy Sherwood, Brad Calder, and George Varghese, Deterministic Memory- 
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Efficient String Matching Algorithms for Intrusion Detection . Proceedings of the IEEE 
Infocom Conference, Hong Kong, China, March 2004. ( p df) 

• SatishNarayanasamy, Yuanfang Hu, Suleyman Sair, and Brad Calder, Creating Converg ed 
Trace Schedules Using String Matching . In the 10th International Symposium on High 
Performance Computer Architecture, February 2004. (pdf) 

2003 

• Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, and Brad Calder, Discovering 
and Exploiting Program Phases, IEEE Micro: Micro's Top Picks from Computer Architecture 
Conferences, December 2003 ( pdf) 

• Jeremy Lau, Stefan Schoenmackers, Timothy Sherwood, and Brad Calder, Reducing Code Size 
With Echo Instructions . International Conference on Compilers, Architecture,, and Synthesis for 
Embedded Systems, October 2003. ( pdf) 

• Erez Perelman, Greg Hamerly, and Brad Calder, Picking Statistically Valid and Early 
Simulation Points . International Conference on Parallel Architectures and Compilation 
Techniques, September 2003. ( pdf) 

• Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder, 
Using SimPoint for Accurate and Efficient Simulation. . ACM SIGMETRICS the International 
Conference on Measurement and Modeling of Computer Systems, June 2003. ( pdf) 

• Timothy Sherwood, George Varghese, and Brad Calder, A Pipelined Memory Architecture for 
Hi gh Throug h put Network Processors , 30th International Symposium on Computer 
Architecture, June 2003. ( pdf) 

• Timothy Sherwood, Suleyman Sair, and Brad Calder, Phase Tracking and Prediction, 3 0th 
International Symposium on Computer Architecture, June 2003. ( pdf) 

• Weihaw Chuang and Brad Calder, Predicate Prediction for Efficient Out-of-order Execution, 

16th Annual ACM International Conference on Supercomputing, June 2003. ( pdf) 

• Weihaw Chuang, Brad Calder, and Jeanne Ferrante, Phi Predication for Light Weig ht If- 
Conversion , International Symposium on Code Generation and Optimization, March 2003. ( pdf) 

• Suleyman Sair, Timothey Sherwood and Brad Calder, A Decoupled Predictor-Directed Stream 
Prefetching Architecture . In the IEEE Transactions on Computers, Vol. 52, No. 5, March 2003 

• Andrew Chien, Brad Calder, Stephen Elbert, and Karan Bhatia . Entropia: Architecture and 
Performance of an Enterprise Desktop Grid System . Journal of Parallel Distributed 
Computing, Vol 63, Issue 5, May 2003, pages 597-610. £pdfj 

• Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese, 
Catching Accurate Profiles in Hardware . In the 9th International Symposium on High 
Performance Computer Architecture, February 2003. ( pdf) 
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• Beth Simon, Brad Calder, and Jeanne Ferrante, Incorporating Predicate Information Into 
Branch Predictors . 9th International Symposium on High Performance Computer Architecture, 
February 2003. (pdf) 

2002 

• Jamison Collins, Suleyman Sair, Brad Calder, and Dean Tullsen, Pointer-Cache Assisted 
Prefetching . To appear in the 35th Annual International Symposium on Microarchitecture, 
November 2002. ( pdf) 

• Timothy Sherwood, Erez Perelman, Greg Hamerly and Brad Calder, Automatically 
Characterizing Large Scale Program Behavior . In the 10th International Conference on 
Architectural Support for Progr ammin g Languages and Operating Systems, October 2002. ( pdf) 

• Eric Tune, Dean Tullsen, and Brad Calder, Quantifying Instruction Cri ticality, in the Eleventh 
International Conference on Parallel Architectures and Compilation Techniques, September 2002. 
(pdfi 

• Lori Carter and Brad Calder, Usin g Predicate Path Information in Hardware to Determine 
True Dependences . In the proceedings of the 16th Annual ACM International Conference on 
Supercomputing, June 2002. (pdf) 

• Lori Carter, Weihaw Chuang, and Brad Calder An EPIC Processor with Pending Functional 
Units . In the proceedings of the 4th International Symposium on High Performance Computing 
(ISHPC2K), May 2002, (c) Springer-Verlae . £pdfj 

• Suleyman Sair, Timothy Sherwood and Brad Calder, Quantifying Load Stream Behavior. , 8th 
International Symposium on High-Performance Computer Architecture, February 2002. (pdf). 

2001 

• Timothy Sherwood and Brad Calder, Patchable Instruction ROM Architecture. , International 
Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), 
November 2001 . ( pdf) 

• Timothy Sherwood, Erez Perelman and Brad Calder, Basic Block Distribution Analysis to Find 
Periodic Behavior and Simulation Points in A p plications , International Conference on Parallel 
Architectures and Compilation Techniques, September 2001. ( pdf) 

■ Chandra Krintz and Brad Calder, Reducing Delay with Dynamic Selection of Compression 
Formats T enth international Symposium on High Performance Distributed Computing, August 
2001. (pdfj 

■ Timothy Sherwood and Brad Calder, Automated Design of Finite State Machine Predictors for 
Customized Processors . 28th International Symposium on Computer Architecture, June 2001. 

torn 

• Chandra Krintz and Brad Calder, Using Annotations to Reduce Dynamic Optimization Time, 
ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation 
(PLDI), June 2001. ( pdf) 
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• Glenn Reinman, Brad Calder, and Todd Austin, O ptimizations Enabled by a Decoupled Front- 
End Architecture. IEEE Transactions on Computers, Vol. 50, No. 4, April 2001. (pdf) 

• Chandra Krintz, David Grove, Vivek Sarkar, and Brad Calder, Reducing the Overhead of 
Dynamic Compilation . Software: Practice and Experience, pages 717-738, Volume 31, Issue 8, 
March 2001.£pdf) 

• Eric Tune, Dongning Liang, Dean M. Tullsen, and Brad Calder, Dynamic Prediction of the 
Critical Dependence Path . 7th International Symposium On High Performance Computer 
Architecture, January 2001. ( pdf) 

2000 

• Timothy Sherwood, Suleyman Sair, and Brad Calder, Predictor-Directed Stream Buffers . In 
proceedings of the 33rd International Symposium on Microarchitecture, December 2000. £pdf) 

• Lori Carter, Beth Simon, Brad Calder, Larry Carter, and Jeanne Ferrante, Path Analysis and 
Renaming for Predicated Instruction Scheduling International Journal of Parallel 
Programming, pages 563-588, December 2000. ( pdf) 

• Timothy Sherwood and Brad Calder, Loop Termination Prediction . In the proceedings of the 
3rd International Symposium on High Performance Computing (ISHPC2K), October 2000, .(c) 
S pringer-Verlag . (pdf) 

• Barbara Kreaseck, Dean Tullsen, and Brad Calder, Limits of Task-based Parallelism in 
Irregular Applications . In the proceedings of the 3rd International Symposium on High 
Performance Computing (ISHPC2K), October 2000, fc) Springer-Verlag . ( pdf) 

• Timothy Sherwood and Brad Calder, TooIBlocks: An Infrastructure for the Construction of 
Memory Hierarchy Analysis Tools . In the proceedings of the International European 
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Abstract 

In a single second a modern processor can execute billions 
of instructions. Obtaining a bird's eye view of the behavior of a 
program at these speeds can be a difficult task when all that is 
available is cycle by cycle examination. In many programs, be- 
havior is anything but steady state, and understanding the pat- 
terns of behavior, at run-time, can unlock a multitude of opti- 
mization opportunities. 

In this paper, we present a unified profiling architecture that 
can efficiently capture, classify, and predict phase-based pro- 
gram behavior on the largest of time scales. By examining the 
proportion of instructions that were executed from different sec- 
tions of code, we can find generic phases that correspond to 
changes in behavior across many metrics. By classifying phases 
genetically, we avoid the need to identify phases for each opti- 
mization, and enable a unified prediction scheme that can fore- 
cast future behavior. Our analysis shows that our desigri can 
capture phases that account for over 80% of execution using less 
that 500 bytes of on-chip memory, 

1 Introduction 

Modern processors can execute upwards of 5 billion instructions 
in a single second, yet most architectural features target program 
behavior on a time scale of hundreds to thousands of instruc- 
tions, less than half a pS. While these optimizations can provide 
large benefits, they are limited in their ability to see die program 
behavior in a larger context. 

Recently there has been a renewed interest in examin- 
ing the run-time behavior of programs over longer periods of 
time [10, 11, 19, 20, 3]. It has been shown that programs can 
have considerably different behavior depending on which por- 
Uon of execution is examined. More specifically, it has been 
shown that many programs execute as a series of phases, where 
each phase may be very different from the others, while still hav- 
ing a fairly homogeneous behavior within a phase. Taking ad- 
vantage of this time varying behavior can lead to, among other 
things, improved power management, cache control, and more 
efficient simulation. The primary goal of this research is the de- 
velopment of a unified run-time phase detection and prediction 
mechanism that can be used to guide any optimization seeking 
to exploit large scale program behavior. 

A phase of program behavior can be defined in several ways. 
Past definitions are built around the idea of a phase being an in- 
terval of execution during which a measured program metric is 
relatively stable. We extend this notion of a phase to include all 
similar sections of execution regardless of temporal adjacency. 
Simply put, if a phase of execution is correctly identified, there 



should only be small variations between any two execution in- 
tervals identified as being part of the same phase. A key point of 
this paper is that the phase behavior seen in any program metric 
is directly a function of the way the code is being executed. If 
we can accurately capture this behavior at run-time through the 
computation of a single metric, we can use this to guide many 
optimization and policy decisions without duplicating phase de- 
tection mechanisms for each optimization. 

In this paper, we present an efficient run-time phase tracking 
architecture that is based on detecting changes in the propor- 
tions of the code being executed. In addition, we present a novel 
phase prediction architecture that can predict, not only when a 
phase change is about to occur, but also the phase to which -it 
is will transition. Since our phase tracking implementation is 
based upon code execution frequencies, it is independent of any 
individual architecture metric. This allows our phase tracker to 
be used as a general profiling technique building up a profile or 
database of architecture information on a per phase basis to be 
used later for hardware or software optimization. Independence 
from architecture metrics allows us to consistently track phase 
information as the program's behavior changes due to phase- 
based optimizations. 

We demonstrate the effectiveness of our hardware based 
phase detection and classification architecture at automatically 
partitioning the behavior of the program into homogeneous 
phases of execution and to identify phase changes. We show 
that the changes in many important metrics, such as IPC and en- 
ergy, correlate very closely with the phase changes found by our 
metric. We then evaluate the effectiveness of phase tracking and 
prediction for value profiling, data cache reconfiguration, and 
re-configuring the width of the processor. 

The rest of the paper is laid out as follows. In Section 2, 
prior work related to phase-based program behavior is discussed. 
Simulation methodology and benchmark descriptions can be 
found in Section 3. Section 4 describes our phase tracking ar- 
chitecture. The design and evaluation of the phase predictor are 
found in Section 5. Section 6 presents several potential applica- 
tions of our phase tracking architecture. Finally, the results are 
summarized in Section 7. 

2 Related Work 

In this Section we describe work related to phase identification 
and phase-based optimization. 

In [19], we provided an initial study into the time varying 
behavior of programs, showing that programs have repeatable 
phase-based behavior over many hardware metrics — cache be- 
havior, branch prediction, value prediction, address prediction, 
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IPC and RUU occupancy for all the SPEC 95 programs. Looking 
at these metrics over time, we found that many programs have 
repeating patterns, and that important metrics tend to change at 
the same time. These places represent phase boundaries. 

In [20], we proposed that by profiling only the code that was 
executed over time we could automatically identify periodic and 
phase behavior in programs. The goal was to automatically find 
the repeating patterns observed in [19], and the lengths (peri- 
ods) of these patterns. We then extended this work in [21], using 
techniques from machine learning to break the complete exe- 
cution of the program into phases (clusters) by only tracking the 
code executed. We found that intervals of execution grouped into 
the same phase had similar behavior across all the architecture 
metrics examined. From this analysis, we created a tool called 
SimPoint [21], which automatically identifies a small set of in- 
tervals of execution (simulation points) in a program to perform 
architecture simulations. These simulation points provide an ac- 
curate and efficient representation of the complete execution of 
the program. 

The work of Dhodapkar and Smith [10, 9] is the most closely 
related to ours. They found a relationship between phases and 
instruction working sets, and that phase changes occur when the 
working set changes. They propose that by detecting phases and 
phase changes, multi-configuration units can be re-configured in 
response to these phase changes. They have used their working 
set analysis for instruction cache, data cache and branch predic- 
tor re-configuration to save energy [10, 9], 

The work we present in this paper identifies phases and phase 
changes by keeping track of the proportions in which the code 
was executed during an interval based upon the profiler used 
in [20]. In comparison, Dhodapkar and Smith [10, 9] track the 
phase and phase changes solely upon what code was executed 
(working set), without weighting the code by its frequency of 
execution. Future research is needed to compare these two ap- 
proaches. 

Additional differences between our work include our exam- 
ination of architectures for predicting phase changes, and differ- 
ent uses from [10, 9], such as value profiling and processor width 
reconfiguration. We provide an architecture that can fairly accu- 
rately predict what the next phase will be, along with predicting 
when there will be a phase change. In comparison, Dhodapkar 
and Smith do not examine phase-based prediction [10, 9], but 
concentrate on detecting when the working set size changes, and 
then reactively apply optimization. 

Merten et al. [15] developed a run-time system for dynami- 
cally optimizing frequently executed code. Then in [3], Barnes 
et al. extend this idea to perform phase-directed complier op- 
timizations. The main idea is the creation of optimized code 
"packages" that are targeted towards a given phase, with the goal 
of execution staying within the package for that phase. Barnes et 
al. concentrate primarily on the compiler techniques needed to 
make phase-directed compiler optimizations a reality, and do not 
examine the mechanics of hardware phase detection and classi- 
fication. We believe that using the techniques in [3] in conjunc- 
tion with our phase classification and prediction architecture will 
provide a powerful run-time execution environment. 
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3 Methodology 

To perform our study, we collected information for ten 
SPEC 2000 programs applu, apsi, art, bzip, facerec, 
galgel, gec, gzip, mcf, and vpr all with reference inputs. 
All programs were executed from start to completion using Sim- 
pleScalar [5] and Wattch [4]. Because of the lengthy simulation 
time incurred by executing all of the programs to completion, 
we chose to focus on only 10 programs. We chose the above 
10 programs since their phase based behavior represents a rea- 
sonable snapshot of the SPEC 2000 benchmark suite, along with 
picking some of the programs that showed the most interesting 
phase-based behavior. Each program was compiled on a DEC 
Alpha AXP-21 164 processor using the DEC C, and FORTRAN 
compilers. The programs were built under OSF/1 V4.0 operating 
system using full compiler optimization (-04 -if o). 

The timing simulator used was derived from the Sim- 
pleScalar 3.0 tool set [5], a suite of functional and timing simu- 
lation tools for the Alpha AXP ISA. The baseline microarchitec- 
ture model is detailed in Table 1 . In addition to this, we wanted 
to examine energy usage optimizations, so we used a version of 
Wattch [4] to capture this information. We modified all of these 
tools to log and reset the statistics every 10 million instructions, 
and we use this as a base for evaluation. 

4 Phase Capture 

In this section we motivate the occurrence of phase-based behav- 
ior, describe our architecture for capturing it, and examine the 
accuracy of using the program behavior in our phase-tracking 
architecture to identify phase changes for various hardware met- 
rics. 

4.1 Phase-Based Behavior 

The goal of this research is to design an efficient and general pur- 
pose technique for capturing and predicting the run-time phase 
behavior of programs for the purpose of guiding any optimiza- 
tion seeking to exploit large scale program behavior. Figure 1 
helps to motivate our approach to the problem. This figure shows 
the behavior of two programs, gec and gzip, as measured by 
various different statistics over the course of their execution from 
start to finish. Each point on the graph is taken over 10 mil- 
lion instructions worth of execution. The metrics shown are the 
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Figure 1: To illustrate the point that phase changes happen across many metrics all at the same time, we have plotted the value 
of these metrics over billions of instructions executed for the programs gcc (shown left) and gzip (shown right). Each point on 
the graph is an average over 10 million instructions. The number of unified L2 cache misses (ul2), the energy consumed by the 
execution of the instructions, the number of instruction cache (ill) misses, the number of data cache misses (dll), the number of 
branch mispredictions (bpred) and the average IPC are plotted. 



number of unified L2 cache misses (ul2), the energy consumed 
by the execution of the instructions, the number of instruction 
cache (ill) misses, the number of data cache misses (dll), the 
number of branch mispredictions (bpred) and the average IPC. 
The results show that all of the metrics tend to change in unison, 
although not necessarily in the same direction. In addition to 
this, patterns of recurring behavior can be seen over very large 
time scales. 

As can be seen from these graphs, even at a granularity of 10 
million instructions (which is at the same time scale as operating 
system time slices) there can be wildly different behavior seen 
between intervals. In this paper, we concentrate on a granularity 
of 10 million instructions because it is both outside the scope 
of normal architectural timing and is small enough to allow for 
many complex phase behaviors to be seen. 

4.2 TVacking Phases by Executed Code 

Our phase tracker architecture operates at two different time 
scales. It gathers profile information very quickly in order to 
keep up with processor speeds, while at the same time it com- 
pares any data it gathers with information collected over the long 
term. Additionally, it must be able to do all that while still being 
reasonable in size. 

Our phase profile generation architecture can be seen in Fig- 
ure 2. The key idea is to capture basic block information during 
execution, while not relying on any compiler support. Larger 
basic blocks need to be weighed more heavily as they account 
for a more significant portion of the execution. To approximate 
gathering basic block information, we capture branch PCs and 
the number of instructions executed between branches. The in- 
put to the architecture is a tuple of information: a branch iden- 
tifier (PC) and the number of instructions since the last branch 



PC was executed. This allows us to roughly capture each basic 
block executed along with the weight of the basic block in terms 
of the number of instructions executed, as we did in [20, 21] for 
identifying simulation points. 

Classifying phases by examining only the code that is ex- 
ecuted allows our phase tracker to be independent of any in- 
dividual architecture metric. This allows our phase tracker to 
be used as a general profiling technique building up a profile or 
database of architecture information on a per phase basis to be 
used later for hardware or software optimization. Independence 
from architecture metrics is also very important to allow us to 
consistently track phase information as the program's behavior 
changes due to phase-based optimizations. 

At this point it is worth making more explicit the differences 
between our technique and that of Dhodapkar and Smith [10, 9], 
Dhodapkar and Smith use a bit vector to track the working set of 
the code for a particular interval. While our technique is based 
on the basic block vectors used in [20]. The bit vectors of Dho- 
dapkar and Smith track a metric that is related to which code 
blocks were touched, whereas our metric tracks the proportion 
of time spent executing in each code block. This is a subtle but 
important distinction. We have found that in complex programs 
(such as gcc and gzip) there are many instructions blocks that 
execute only intermittently. When tracking the pure working set, 
these infrequently executed blocks can disguise the frequently 
executed blocks that dominate the behavior of the application. 
On the other hand, by tracking the frequency of code execution 
it is possible to distinguish important instructions (basic blocks) 
from a sea of infrequently executed ones. Examining these dif- 
ferences in more detail is a topic of future research. 

Another advantage of tracking the proportions in which the 
basic blocks are executed is that we can use this information to 
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Figure 2: Our phase classification architecture. Each branch PC 
is captured along with the number of instructions from the last 
branch. The bucket entry corresponding to a hash of the branch 
PC is incremented by the number of instructions. After each 
profiling interval has completed, this information is classified, 
and if it is found to be unique enough, stored in the past footprint 
table along with its phase ID. 

identify not only when different sections of code are executing, 
but also when those sections of code are being exercised differ- 
ently. A simple example is in a graphics manipulation program 
running a parameterized filter on an input image. If you run a 
simple 3x3 blur filter on an image you get very different behavior 
than if you run a 7x7 filter on the same image despite the fact that 
the same filter code is executing. The 7x7 filter will have many 
more memory references and those memory references conflict 
very differently in the cache than in the 3x3 case. We have seen 
this very behavior in examining the interactive graphics program 
xv. Using the proportion of execution for each basic block can 
distinguish these differences, because in the 3x3 filter the head 
of the loop is called more than twice as frequently as in the 7x7 
filter. 

The same general idea applies to other data structures as 
well. Take for example a linked list. As the number of nodes in 
the linked list traversal changes over different loop invocations, 
the number of instructions executed inside the loop versus the 
time spent outside the loop also changes. This behavior can be 
captured when including a measure of the proportion of the code 
executed, and this can distinguish between link list traversals of 
different lengths. 

4.3 Capturing the Code Profile 

To index into the accumulator table in Figure 2, the branch PC 
is reduced to a number from 1 to Nbuckets using a hash func- 
tion. We have found that 32 buckets is sufficient to distinguish 
between different phases even for some of the more complex 
programs such as gcc. A counter is kept for each bucket, and 
the counter is incremented by the number of instructions from 
the last branch to the current branch being processed. Each ac- 
cumulator table entry is a large (in this study 24-bit), saturating 
counter, which will not saturate during our profiling interval of 
10 million instructions. Updating the accumulator table is the 
only operation that needs to be performed at a rate equivalent to 



the processor's execution of the program (once for every branch 
executed). In comparison, the phase classification described be- 
low needs to only be performed once every 10 million instruc- 
tions (at the end of each interval), and thus is not nearly as per- 
formance critical. 

We note that the hashing function we use is fundamentally 
the same as the random projection method we used to generate 
phases in [21]. In this prior work, we make use of random pro- 
jections of the data to reduce the dimensionality of the samples 
being taken. A random projection takes trace data in the form of 
a matrix of size LxB, where L is the length of the trace and B is 
the number of unique basic blocks, and multiplies it by a random 
matrix of size BxJV, where N is the desired dimensionality of 
the data which is much smaller than B. This creates a new ma- 
trix of size L x N, which has clustering properties very similar 
to the original data. The random projection method is a powerful 
technique when used with clustering algorithms, and for captur- 
ing phase behavior as we showed in [21]. The hashing scheme 
we use in this paper is essentially a degenerate form of random 
projection that makes a hardware implementation feasible while 
still having low error. If all the elements of the random projec- 
tion matrix consist of either a 0 or a 1 , and they are placed such 
that no column of the matrix contains more than a single 1, then 
the random projection is identical to this simple hashing mech- 
anism. We have designed our phase classification architecture 
around this principle. 

Figure 3 shows the effect of applying the above mentioned 
technique for capturing the phase behavior of the integer bench- 
mark gzip. The x-axis of the figure is in billions of instructions, 
as is the case in Figure 1. Each point on the y-axis represents an 
entry of the phase tracker's accumulator table. Each point on the 
graph corresponds to the value of the corresponding accumulator 
table entry at the end of a profiling interval. Dark values repre- 
sent high execution frequency, while light values correspond to 
low frequency. The same trends that were seen in Figure 1 for 
gzip can be clearly seen in Figure 3. In both of these figures, 
when observing them at the coarsest granularity, we can see that 
there are at least three different phases labeled A, B and C. In 
Figure 3, the phase tracker table entries 2 , 5, 7, 13 and 
17 distinguish the two identical long running phases labeled A 
from a group of three long running phases labeled C. Phase table 
entries 12 and 2 0 clearly distinguish phase B from both A and 
C. This figure is pictorial evidence that the phase tracker is able 
to break the program's execution into the corresponding phases 
based solely on the executed code, and that these phases corre- 
spond to the behavior seen across the different program metrics 
in Figure 1. 

4.4 Forming a Footprint 

After the profiling interval has elapsed, and branch block infor- 
mation has been accumulated, the phase must then be classified. 
To do this we keep a history of past phase information. 

If we fix the number of instructions for a profiling interval, 
then we can divide each bucket by this fixed number to get the 
percentage of execution that was accounted far by all instruc- 
tions mapped to that bucket. However, we do not need to know 
the exact percentages for each bucket. Instead of keeping the 
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Figure 3: Visualization of the accumulator table used to track 
program behavior for gzip. Tfie x-axis is in billions of instruc- 
tions, while they-axis is the entry of the accumulator table. Each 
point on the graph corresponds to the value of the accumulator 
table at the end of a profiling interval where dark values corre- 
spond to more heavily accessed entries. The same trends that 
were seen in Figure J can be clearly seen in Figure 3. 

full counter values, we can instead compress phase information 
down to a couple of the most significant bits. This compressed 
information will then be kept in the Past Footprint table as shown 
in Figure 2. 

The number of counter value bits that we need to observe is 
related to Nbuckets. As we increase the number of buckets, the 
data is spread over more buckets (table entries), making for less 
entries per bucket (better resolution) but at the cost of more area 
(both in terms of number of buckets and more bits per bucket). 
To be on the safe side, we would like any distribution of data into 
buckets to provide useful information. To achieve this we need 
to ensure that even if data is distributed perfectly evenly over 
all of the buckets, we would still record information about the 
frequency of those buckets. This can be achieved by reducing 
the accumulator counter by: 

(bucket[i] x Nbuckets) /(intervalsize) 

If the number of buckets and interval size are powers of two, 
this is a simple shift operation. For the number of buckets we 
have chosen (32), and the interval size we profile over, this re- 
duces the bucket size to 6 bits, and thus requires 24 bytes of stor- 
age for each unique phase in the Past Footprint table. In practice 
we see that the top 6 bits of the counter are more than enough 
to distinguish between two phases. In the worst case, you may 
need one or two more bits to reduce quantization error, but in 
reality we have not seen any programs that cause this to be an 
issue. 

If too few buckets are used, aliasing effects can occur due 
to the hashing function, where two different phases will appear 
to have very similar Footprints. Therefore, we want to use a 
sufficiently large number of buckets to uniquely identify the dif- 
ferences in code execution between phases, while at the same 
time use only a small amount of area. 

To examine the aliasing effect and determine what the appro- 
priate number of buckets should be, Figure 4 shows the sum of 
the differences in the bucket weights found between all sequen- 
tial intervals of execution. The y-axis shows the sum total of 
differences for each program. This is calculated by summing the 
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Figure 4: The percent difference found between Footprints from 
sequential intervals of execution, when varying the number of 
counters used to represent the footprints. The results are nor- 
malized to the difference between intervals found when having 
an infinite number of buckets to represent the footprint; 32 pick- 
ets captures most of the benefit. 

differences between the buckets captured for interval i and i — 1 
for each interval i in the program. The x-axis is the number of 
distinct buckets used. All of the results are compared to the ideal 
case of using an infinite number of buckets (or one for each sep- 
arate basic block) to create the Footprint On the program gcc 
for example, the total sum of differences with 32 buckets was 
72% of that captured with an infinite number of buckets. In gen- 
eral we have foundlhat 32 buckets was enough to distinguish 
between two phases. 

4.5 Classifying a Footprint to a Phase ED 

After reducing the vector to form a footprint, we begin the clas- 
sification process by comparing the footprint to a set of repre- 
sentative past footprint vectors. We compare the current vector 
to each vector in the table. The next section details how we per- 
form the comparison and determine what a match is. If there is 
a match, we classify the profiled section of execution into the 
same phase as the past footprint vector, and the current vector 
is not inserted into the past footprint table. If there is no match, 
then we have just detected a new phase and hence must create a 
new unique phase ID into which we may classify it. This is done 
by choosing a unique phase ID out of a fixed pool of IDs. When 
allocating a new phase ID, we also allocate a new past footprint 
entry, set it to the current vector, and store with that entry the 
newly allocated phase ED. This allows future similar phases to 
be classified with the same ID. In this way only a single vector 
is kept for each unique phase ID, to serve as a representative of 
that phase. After a phase ED is provided for the most recent in- 
terval, it is passed along to prediction and statistic logging, and 
the phase identification part of our algorithm is completed. 

To examine the number of phase IDs we need to track, Fig- 
ure 5 shows the percentage of execution that can be accounted 
for by the top p phases, where p is shown on the x-axis. Re- 
sults are graphed for the programs that had the min (galgel) 
and max (art) coverage, gcc, g zip, and the overall average. 
These results show that most of the program's phase behavior 
can be captured using a relatively small number of phase IDs. 



5 



( 




Number of Hardware Detected Phases 

Figure 5: Results of the minimum number of phases that need 
to be captured versus the amount program execution they cover. 
The y-axis is the percent of program execution that is covered. 
Tfie x-axis is the minimum number of phases needed to capture 
that much program execution. 

If we only track and optimize for the top 20 phases in each ap- 
plication, we will capture and be able to accurately apply phase 
prediction/optimizations to over 90% of the program's execution 
on average. In the worst case (min), we are able to optimize most 
of the program Cover 80%) by only targeting a small number (20) 
of important recurring phases. 

4.S.1 Finding a Match 

We search through the Footprint histories to find a match, but 
this query is complicated by the fact that we are not necessar- 
ily searching for an exact match. Two sections of execution that 
have very similar footprints could easily be considered a match, 
even if they do not compare exactly. To compare two vectors 
to one another, we use the Manhattan distance between the two, 
which is the element-wise sum of the absolute differences. This 
distance is used to determine if the current interval should be 
classified as the same phase ID as one of the past footprint inter- 
vals. 

If we set the distance threshold too low, the phase detection 
will be overly sensitive, and we will classify the program into 
many, very tiny phases which will cause us to lose any bene- 
fit from doing run-time phase analysis in the first place. If the 
threshold is too high, the classifier will not be able to distinguish 
between phases with different behavior. To quantify this effect, 
we examine haw well our hardware technique classifies phases 
for a variety of thresholds compared to the phases found by the 
off-line clustering algorithm used in SimPoint [21]. 

The SimPoint tool is able to make global decisions to opti- 
mize the grouping of similar intervals into phases. The off-line 
algorithm makes no use of thresholds, instead its decisions are 
based solely on the structure found in the distribution of pro- 
gram behaviors. Our technique must be far more simplistic be- 
cause it must be performed on-line and with limited computa- 
tional overhead. This reduction in complexity comes at the cost 
of increased error. 

The Different Phases line in Figure 6 shows the ability of 
our hardware technique to find phase changes (transitions be- 
tween one phase and the next) when different thresholds are used 
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Figure 6: Results showing how well our hardware phase tracker 
classifies Avo sequential intervals of execution as being from 
"Different " or the "Same " phase of execution. The percent of 
misclassifications are shown in comparison to the phase classi- 
fications found using the off-line clustering SimPoint tool [21]. 

to perform the phase classification. For example, when using a 
Manhattan distance of 1 million as our threshold (shown as 20 
on our x-axis because it is in log 2 ), our hardware technique iden- 
tified 80% of the phase changes that occurred in the more com- 
plex off-line SimPoint analysis. Conversely, 20% of the phase 
changes were incorrectly classified as having the same phase ID 
as the last interval of execution. 

Likewise, the Same Phases line in Figure 6 represents the 
ability of our hardware technique to accurately classify two se- 
quential intervals as being part of the same phase as a function 
of different thresholds (again as compared to the off-line cluster- 
ing analysis). For example, when using a Manhattan distance of 
1 million (shown as 20 on the x-axis), our hardware technique 
identified 80% of the intervals that stayed in the same phase 
as correctly staying in the same phase, but 20% of those inter- 
vals were classified as having a different phase ID from the prior 
phase. 

A misclassification occurs when two sequential intervals of 
execution are classified as being in the same phase or in different 
phases using our hardware approach when the off-line clustering 
analysis tool found the opposite for these two intervals. 

If we are too aggressive and our hardware phase analysis in- 
dicates that there are phase changes when there are actually no 
noticeable changes in behavior, then we will create too many 
phase IDs that have similar behavior. This can create more over- 
head for performing phase-based optimization. On the other 
hand, if we are too passive in distinguishing between different 
phases, we will be missing opportunities to make phase specific 
optimizations. 

In order to strike a balance between having a high capture 
rate and reducing the percent of false positives, we chose to use 
a threshold of 1 million. When comparing this with the interval 
size of 10 million instructions, this means that a difference in the 
phase behavior will be detected if 10% of the executed instruc- 
tions are in different proportions. In choosing 1 million, we have 
on average a 20% misclassification rate. Note, that a misclassi- 
fication does not necessarily mean that an incorrect optimization 
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will be performed. For example, if we have a "Same Phase" mis- 
classification (the two intervals were really from the same phase, 
but were classified into different phases), then a phase change is 
observed using our hardware technique when there was not one 
in the baseline classifier. If the two hardware detected phases 
have the same optimization applied to them, then this misclassi- 
fication can have no effect. 

4.6 Fer-Phase Performance Metric Homogeneity 

Using the techniques presented above, we can perform phase 
classification on programs at run-time with little to no impact 
on the design of the processor core. One of the goals of phase 
classification is to divide the program into a set of phases that are 
fairly homogeneous. This means that an optimization adapted 
and applied to a single segment of execution from one phase, 
will apply equally well to the other parts of the phase. In order to 
quantify the extent to which we have achieved this goal, we need 
to test the homogeneity of a variety of architectural statistics on 
a per-phase basis. 

Figure 7 shows the results of performing this analysis on the 
phases determined at run-time. Due to space constraints we only 
show results for two of the more complicated programs gcc and 
gzip. For both programs, a set of statistics for each phase is 
shown. The first phase that is listed (separated from the rest) as 
full, is the result of classifying the entire program into a single 
phase. The results show that for gcc for example, the average 
IPC of the entire program was 1.32, while the average number 
of cache misses was 445,083 per ten million instructions. In 
addition to just the average value, we also show the standard 
deviation for that statistic. For example, while the average IPC 
was 1.32 for gcc, it varied with a standard deviation of over 
43% from interval to interval. If the phase-tracking hardware is 
successful in classifying the phases, the standard deviations for 
the various metrics should be low for a given phase ID. 

Underneath the phase marked full are the five most fre- 
quently executed phases from the program as identified by our 
phase tracker. The phases are weighted by the percentage of the 
program's executed instructions they account for. For gcc, the 
largest phase accounts for 18.5% of the instructions in the entire 
program and has an average IPC of 0.61 and a standard devi- 
ation of only 1.6% (of 0.61). The other top four phases have 
standard deviations at or below this level, which means that our 
technique was successful at dividing up the execution of gcc 
into large phases with similar execution behavior with respect to 
IPC. Note, that some metrics for certain phases have a high stan- 
dard deviation, but this occurs for architecture features/metrics 
that are unimportant for that phase. For example, the phase that 
occurs for 7.2% of execution in gcc has only 75 LI instruction 
cache misses on average. This is an LI miss rate of 0.00075%, 
so an error of 215% for this metric will not likely have any effect 
on the phase. 

When we look at the energy consumption of gcc, it can be 
observed that energy consumption swings radically (a standard 
deviation of 90%) over the complete execution of the program. 
This can be seen visually in Figure 1, which plots the energy 
usage versus instructions executed. However, after dividing the 
program into phases, we see that each phase has very little vari- 



ation within itself. All have less than 2% standard deviation. 
By analyzing gcc it can also be seen that the phase partitioning 
does a very good job across all of the measured statistics even 
though only one metric is used. This indicates that the phases 
that we have chosen are in some way representative of the actual 
behavior of the program. 

5 Phase Prediction 

The prior section described our phase tracking architecture, and 
how it can be used to classify phases. In this section we focus on 
using phase information to predict the next phase. For a variety 
of applications it is important to be able to predict future phase 
changes so that the system can configure for the code it will soon 
be executing rather than simply reacting to a change in behavior. 

Figure 8 shows the percentage interval transitions that are 
changes in phase, for our set of benchmarks. For all of these pro- 
grams, phase changes come quite often, but it should be noted 
that this statistic alone cannot gauge the complexity of the pro- 
gram behavior. The program gcc switches less than 10% of 
the time but switches between many different phases. The other 
extreme is art which switches almost half the time, but it is 
only switching between a few distinct phases. In this case, large 
repeating patterns can be observed. No two phases executing se- 
quentially are that similar, but there is an order to the sequence. 
By adding in a prediction scheme for these cases, we not only 
take advantage of stable conditions as in past research, but actu- 
ally take advantage of any repeating patterns in program behav- 
ior. 

5.1 Markov Predictor 

The prediction of phase behavior is different from many other 
systems in which hardware predictors are used. Because of this 
new environment, a new type of predictor has the potential to 
perform better than simply using predictors from other areas of 
computer architecture (branch and address prediction, memory 
disambiguation, etc.). 

After observing the way that phases change, we determined 
that two pieces of information are important. First, the set of 
phases leading up to the prediction are very important, and sec- 
ond, the duration of execution of those phases is important. 

A classic prediction model that is easily implementable in 
hardware is a Markov Model, Markov Models have been used 
in computer architecture to predict both prefetch addresses [13] 
and branches [8] in the past. The basic idea behind a Markov 
Model is that the next state of the system is related to the last set 
of states. 

The intuition behind this design is that phase information 
tends to be characterized by many sections of stable behavior 
interspersed with abrupt phase changes. The key is to be able to 
predict when these phase changes will occur, and to know ahead 
of time what phase they will change to. The problem is that the 
changes are often preceded by stable conditions, and if we only 
consider the last couple of intervals we will not be able to tell 
the difference between sections of stable behavior that precede 
a phase change, and those sections that will continue to be sta- 
ble. Instead, we need a way of compressing down stable phase 
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Figure 7: Examination of per-phase homogeneity compared to the program as a whole (denoted by full/ For the two programs 
and each of the tap 5 phases of each program, we show the average value of each metric and the standard deviation. Hie name 
of the phase is the percent of execution that it accounts for in terms of instructions. These results show that after dividing up the 
pmgram into phases using our run-time scheme the behavior within each phase is quite consistent. 
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Figure 8: TJie percent of execution intervals that transition to 
a different phase from the prior execution interval's phase as 
found by our phase tracking architecture with 32 footprint coun- 
ters using a 1 million Manhattan threshold. 

information into a piece of information that we can use as state. 

5.2 Run Length Encoding Markov Predictor 

To compress the stable state we use a Run Length Encoding 
(RLE) Markov predictor. The basic idea behind the predictor 
is that it uses a run-length encoded version of the history to in- 
dex into a prediction table. The index into the prediction table is 
a hash of the phase identifier and the number of times the phase 
identifier has occurred in a row. 

Figure 9 shows our RLE Markov Phase ID prediction archi- 
tecture. The the lower order bits of the hash function provide an 
index into the prediction table, and the higher order bits of the 
hash function provide a tag. When there is a tag match, the phase 
ID stored in the table provides a prediction as to the next phase 
to occur in execution. When there is a tag miss, the prior phase 
ID is assume to be the next phase ID to occur in the program's 
execution. We found that predicting the last phase ID to be 75% 
accurate on average. 

We only update the predictor when there is (1) a change in 
the phase ID, or (2) when there is a tag match. We only insert an 
entry when there is a phase ID change, since we want to predict 
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Figure 9: Phase Prediction Architecture for the Run Length En- 
coded (RLE) Markov predictor. The basic idea behind the pre- 
dictor is that two pieces of information are used to generate the 
prediction, the phase id that was just seen, and the number of 
times prior to now that it has been seen in a row. The index into 
the prediction table is a hash of these two numbers. 

when the phase is going to change. Execution intervals where 
the same phase ID occurs several times in a row do not need 
to be stored in the table, since they will be correctly predicted 
as "last phase ID", when the there is a table miss. This helps 
table capacity constraints and avoids polluting the table with last 
phase predictions. For the second update case, when there is 
a tag match, we update the predictor because the observed run 
length may have potentially changed. 

5.3 Predictor Comparison 

We compare our RLE Markov phase predictor with other pre- 
diction schemes in Figure 10. This Figure has four bars for ev- 
ery program, and each bar corresponds to the prediction accu- 
racy of a prediction architecture. The first and simplest scheme, 
Last Phase, simply predicts that the next phase is the same as 
the current phase, in essence always predicting stable operation. 
The prediction accuracy of this scheme is inversely proportional 
to the rate at which phases change in a given benchmark. For 
the program gzip for example, there are long periods of execu- 
tion where the phase does not change, and therefore predicting 
no-change does exceptionally well. 

In order to insure that we were not simply providing an 
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Figure 10: Phase ID Prediction Accuracy. This figure shows 
how well different prediction schemes work. The mast naive 
scheme, last, simply predicts that the phases never change. 
The bars marfced Markov and RLE Markov show how well 
we can predict the phase identifiers if we use a Markov predic- 
tion scheme with a Markov table size of 256 entries. 

expensive filter for noise in the phase IDs, we also compared 
against a simple noise filter which works by predicting that the 
next phase will be the most commonly occurring of the last three 
phases seen. This is not shown, as it actually performed worse 
on all of the programs. 

Additionally we wanted to examine the effect of a simple 
Markov model predictor for history lengths of 1 and 2. The 
Markov model predictor does a better job of predicting phase 
transitions than Last Phase, but it is limited by the Fact that 
long runs will always be predicted as infinitely stable due to the 
history filling up. However, it is still very effective for f acerec 
and applu, but does not provide much benefit for either art or 
galgel. 

The final bar, RLE Markov, is our improved Markov pre- 
dictor which compresses stable phases into a tuple of phase 
id and duration. All of the Markov predictors simulated had 
256 entries taking up less than 500 bytes of storage. Using 
RLE Markov outperforms both the Last Phase and tradi- 
tional Markov on all the benchmarks. It performs especially 
well compared to other schemes on both applu and art. Over- 
all, using a Run-Length Encoded Markov predictor can cut the 
phase mispredictions down to 14% on average. 

6 Applications 

This section examines three optimization areas in which a phase- 
aware architecture can provide an advantage. We begin by ex- 
amining the relationship between phase behavior and value lo- 
cality. We then demonstrate ways to reduce processor energy 
consumption by adjusting the aggressiveness of the data cache 
and the instruction front end. 

6.1 Frequent Value Locality 

Prior work on value predictors has shown that there is a great 
deal of value locality in a variety of programs [14, 7]. Recently, 
researchers have started to take advantage of frequently loaded 



values for the purpose of optimizing caches. For example, Yang 
and Gupta [22] proposed a data cache organization that com- 
presses the most frequently used program values in order to save 
energy. Another way of exploiting value locality is through value 
specialization, which can be done either statically or dynami- 
cally [6, 17, 16] to create specialized versions of procedures or 
code-regions based upon the values frequently seen. These tech- 
niques are built on the idea of finding the mast frequent values 
for loads over the whole program, and then specializing the pro- 
gram to those frequent values. 

We examine the potential of capturing frequent values on a 
per-phase basis and compare this to the frequent values aggre- 
gated over the entire program, as would be used in value code 
specialization [6], To perform this experiment we first gathered 
the top 16 values that were loaded over the complete execution 
of the program and stored them into a table, We then examined 
the percentage of executed loads that found their loaded value in 
this table, This result is shown as Static in Figure 11. While 
significant portions of some programs are covered by just these 
few top values (such as applu), over half of the programs have 
less than 10% of their loaded values covered by these top values. 

The question is: can we do better by exploiting hardware- 
detected phase information? To answer this question we take the 
top 16 values for each phase, as detected by the hardware phase 
tracker. These top values will be shared across a single phase 
even if it is split into two or more different sections of execution. 
Each load in the program is then checked against the top val- 
ues for its corresponding phase. The Phase Coverage bar 
in Figure 11 shows the percent of all load values in the program 
that were successfully matched to it's per-phase top value set. 

Without any notion of loads or values, our method of divid- 
ing up phases is very successful at assisting in the search for fre- 
quent values. By just tracking the top 16 values of each phase, 
we ore able to capture the values from almost 50% of the exe- 
cuted loads on average. The Perfect bar shows percentage of 
loads covered if one captures the top 16 load values for each and 
every interval (i.e., 10 million instructions) separately. This is in 
effect the best that we could hope to achieve for an interval size 
of 10 million instructions, because the 16 entries in the value ta- 
ble are custom crafted for each interval individually. As shown 
in Figure 11, the phase-tracker compares favorably with the op- 
timal coverage. Two thirds of the total possible benefit from 
per-interval value locality can be captured by per-phase value 
locality. It is important to point out this graph by itself is not 
a good indicator of usefulness as near perfect coverage could 
be achieved simply by making every interval a separate phase. 
However, as shown in Figure 5 only a few phases (around 20) 
are used to cover at least 80% of the program's execution. 

6.2 Dynamic Data Cache Size Adaptation 

In a modern processor a significant amount of energy is con- 
sumed by the data cache, but this energy may not be put to 
good use if an application is not accessing large amounts of data 
with high locality. To address this potential inefficiency, previ- 
ous work has examined the potential of dynamically reconfigur- 
ing the data caches with the intention of saving power. In [2], 
Balasubramonian et. al. present two different schemes with 



9 




Figure 11: The percent of the program 's load values that are 
found in a table of the most frequently values loaded over the 
whole program (Static Coverage), on a per-phase basis (Phase 
Coverage), and on a per execution interval basis (Optimal Cov- 
erage). 

which re-configuration may be guided. In one scheme, hard- 
ware performance counters are read by re-configuration software 
every hundred thousand cycles. The software then makes a de- 
cision based on the values of the counters. In another scheme, 
re-configuration decisions are performed on procedure bound- 
aries instead of at fixed intervals. To reduce the overhead of re- 
configuration, software to trigger re-configuration is only placed 
before procedures that account for more than a certain percent- 
age of execution. 

Another form of re-configurable cache that has been pro- 
posed dynamically divides the data cache into multiple parti- 
tions, each of which can be used for a different function such 
as instruction reuse buffers, value predictors, etc [18]. These 
techniques can be triggered at different points in program exe- 
cution including procedure boundaries and fixed intervals. The 
overhead of re-configuration can be quite large and making these 
policy decisions only when the large scale program behavior 
changes, as indicated by phase shifts in our hardware tracker, 
can minimize overhead while guaranteeing adequate sensitivity 
to attain maximum benefit. 

We examined the use of phase tracking hardware to guide an 
energy aware, re-sizable cache. The energy consumption of the 
data cache can be reduced by dynamically shifting to a smaller, 
less associative cache configuration for program phases that do 
not benefit significantly from more aggressive cache configura- 
tions. By targeting only those phases that are predicted to have 
energy savings due to cache size reduction, our scheme is able 
to reduce power with very little impact on the performance. 

We examined an architecture with two possible cache con- 
figurations, 32KB 4-way associative and 8KB direct mapped. In 
Figure 12, the trade off between these two configurations is plot- 
ted. For each program, we use the 32KB cache configuration as 
the baseline result. The labeled circles in Figure 12 show the 
total processor energy savings and performance degradation for 
each program if only the smaller (8KB) cache size is used. For 
example, a processor with a smaller cache configuration for the 
program applu is both 5% slower and uses 5% less energy. 
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Figure 12: Data Cache Re-configuration. The tradeoff between 
energy savings and slowdown for two different cache policies. 
All results are relative to a 32KB 4-way associative cache. The 
circles in the graph (each labeled with a number for the program 
the data point is from) show the energy and performance of an 
8KB direct mapped cache. Tlie triangles show the tradeoff of in- 
telligently switching between an 8KB direct mapped and a 32KB 
4-way data cache based on phase classification and prediction. 

Two programs, vpr and apsi, actually use mare energy with a 
smaller cache due to large slow downs. These two points are off 
the scale of this graph and are not shown. 

While examining energy savings and slow down is interest- 
ing, it is important to note that there is more than one way to 
reduce both energy and performance. Voltage scaling in particu- 
lar has proven to be a technology capable of reaping large energy 
savings for a relative reduction in performance. For our results, 
we assume that for voltage scaling a performance degradation of 
5% will yield an approximate energy saving of 15%. We use this 
rule of thumb as our guideline for determining when to reduce 
the active size of the cache. In Figure 12, this simple model of 
voltage scaling is plotted as a dashed line. When the cache size 
is reduced, most programs fall far short of this baseline, meaning 
that voltage seating would provide a better performance-energy 
tradeoff. There are a couple of exceptions, in particular mcf, 
bzip, and g zip do well even without any sort of phase-based 
re-configuration. 

The shaded triangles in Figure 12 show what happens if 
we use phase classification and prediction to guide our re- 
configuration. When a new phase ID is seen, we sample the IPC 
and energy used for a few intervals using the 32KB 4-way cache, 
and a few intervals for the 8KB direct mapped cache. These sam- 
ples could be kept in a small hardware profiling table associated 
with the phase ID. After taking these samples, if we find that a 
particular phase is able to achieve more than three times the en- 
ergy savings relative to the slow down seen when using the 8KB 
cache, we then predict for this phase JD that the smaller cache 
size should be used. This heuristic means that the small cache 
size is used only if re-configuration would beat voltage scaling 
for that phase. After a decision has been made as to the con- 



10 



0 applu 

1 apst 



2 art 4 facerec 6 gcc 

3 bzip2 5 galgel 7 gzfp 



8 mcf 

9 vpr 



§>20% 



'7,4 



O Low Issue 
A Phase Aware 



5% 10% 15% 20% 

Slowdown 

Figure IB: Processor Width Adaptation. The tradeoff between 
energy savings and slowdown for two different front end poli- 
cies. All results are relative to an aggressive 8-issue machine. 
The circles in the graph (each labeled with a number for the 
program) show the energy and performance of a less aggressive 
2-issue processor. The triangles show using the phase classifier 
and predictor for switching between 2— issue and 8-issue based 
on phase changes. 

figuration to use for a phase ID, the corresponding cache size is 
stored in the phase profiling table/database associated with that 
phase ID. The phase classifier and predictor are then used to pre- 
dict when a phase change occurs. When a phase change predic- 
tion occurs, the predicted phase ID looks up the cache size in the 
profiling table, and re-configures the cache (if it is not already 
that size) at the predicted phase change. 

For all programs, our re-configuration is able to beat 
or tie voltage scaling. Far example, using phase-based re- 
configuration results in a slowdown of 0.5% for applu, while 
the total energy savings is 4.5%. Even the program apsi, which 
had increased energy consumption in the small cache configura- 
tion, is able to get almost 5% energy savings with only a 1% 
slowdown. 

6.3 Dynamic Processor Width Adaptation 

One way to reduce the energy consumption in a processor is to 
reduce the number of instructions entering the pipeline every cy- 
cle [12, 1]. We call this adjusting the width of the processor. 
Reducing the width of the processor reduces the demand on the 
fetch, decode, functional units, and issue logic. Certain phases 
can have a high degree of instruction level parallelism, whereas 
other phases have a very low degree. Take for example the top 
two phases for gcc shown in Figure 7. The intervals classified 
to be in the first phase consisting of 18.5% of execution have an 
IPC of 0.61 with a high data cache miss rate. In comparison, 
the intervals in the second most frequently encountered phase 
(accounting for 18.1% of execution) have an IPC of 1.95 and 
very low data cache miss rates. We can potentially save energy 
without hurting performance by throttling back the width of the 
processor for phases that have low IPC, while still using aggres- 
sive widths for phases with high IPC. 



In the current literature, decisions to reduce or increase the 
fetch/decode/issue bandwidth of the processor are made either 
at fixed intervals (relatively short intervals such as 1,000 cy- 
cles) [12] or, as in the case of branch confidence based schemes, 
when a branch instruction is fetched [1]. It can very difficult to 
design real systems that save energy by reconfiguring at these 
speeds, but a hardware phase-tracker can help make these deci- 
sions at a coarser granularity while still maintaining performance 
and energy benefits. 

We examined an architecture that could be configured with 2 
different widths - one where up to 2 instructions are decoded and 
up to 2 issued per cycle, and one where up to 8 instructions are 
decoded and up to 8 issued per cycle. When a new phase ID is 
seen by the phase tracker, we sample the IPC for three intervals 
with a width of 2 instructions, and three intervals with a width 
of 8 instructions. If there is little difference in the IPC between 
these two widths, then we assign a width of 2 instructions to this 
Phase ID in our profiling table, otherwise we assign a width of 
8 instructions. During execution, we use the phase ID predictor 
to effectively predict the width for the next interval of execution 
and adjust the processor's width accordingly. Our results show 
that the chosen configuration for a given phase can be trained 
(1) with only a few samples, and (2) only once to accurately 
represent the behavior of a given phase ID. This requires very 
little training time due to the fact that 20 or fewer phase IDs 
are needed to capture 80% or more of a program's execution as 
shown in Figure 5. 

Figure 13 is a graph of the results seen when applying phase- 
directed width re-configuration. The white circles in the graph 
show the behavior of running the programs on only a 2-wide 
machine relative to the more aggressive 8-wide machine. The 
dotted line again shows what could potentially be achieved if 
voltage scaling was used. While mcf and art save a lot of en- 
ergy with little performance degradation on a 2-wide machine, 
the other programs do not fair as well. The program apsi, for 
example, has a slowdown of over 22% with an energy savings of 
around 30%. This does not compare favorably to voltage scal- 
ing (as discussed in Section 6.2). On the other hand if we use 
phase-directed width throttling on apsi, a total processor en- 
ergy savings of 18% can be achieved with only 2.2% slowdown. 

For all of the programs we examined, with one exception, 
the slowdown due to phase aware width throttling was less than 
4%, while the average energy savings was 19.6%. This result 
demonstrates that there is significant benefit to be had in the re- 
configuration of processor front end resources even at very large 
granularities. In the worst case, this will mean a re-configuration 
every 10 million instructions, and on average every 70 million 
instructions. This should be designable even under conservative 



7 Summary 

In this paper we present an efficient run-time phase tracking ar- 
chitecture that is based on detecting changes in the code being 
executed. This is accomplished by dividing up all instructions 
seen into a set of buckets based on branch PCs. This way we ap- 
proximate the effect of taking a random projection of the basic 
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block vector, which was shown in [21] to be an effective method 
of identifying phases in programs. 

Using our phase classification architecture with less than 500 
bytes of on-chip memory, we show that for most programs, a sig- 
nificant amount of the program (over 80%) is covered by 20 or 
less distinct phases. Furthermore, we show that these phases, 
while being distinct from one another, have fairly uniform be- 
havior within a phase, meaning that most optimizations applied 
to one phase will work well on all intervals in that phase. In the 
program gcc, the IPC attained by the processor on average over 
the full run of execution is 1.32, but has a standard deviation 
of more than 43%. By dividing it up into different phases, we 
achieve much more stable behavior, with IPCs ranging between 
0.61 and 1.95, but now with standard deviations of less than 2%. 

In addition to this, we present a novel phase prediction archi- 
tecture using a Run Length Encoding Markov predictor that can 
predict not only when a phase change is about to occur, but to 
which phase ID it will transition to. In using this design, which 
also uses less than 500 bytes of storage, we achieve a phase 
prediction miss rate of 10% for applu and 4% for apsi. In 
comparison, always predicting that the phase will stay the same 
results in a miss rate of 40% and 12% respectively. 

We also examined using our phase tracking and prediction 
architecture to enable new phase-directed optimizations. Tra- 
ditional architecture and software optimizations are targeted at 
the average or aggregate behavior of a program. In comparison, 
phase-directed optimizations aim at optimizing a program's per- 
formance tailored to the different phases in a program. In this pa- 
per, we examined using phase tracking and prediction to increase 
frequent value profiling coverage, and to provide energy savings 
through data cache and processor width re-configuration. 

We believe our phase tracking and prediction design will 
open the door for a new class of run-time optimization that tar- 
gets large scale program behavior. Even though we present a 
hardware implementation for phase tracking, a similar design 
can be implemented in software to perform phase classification 
for run-time optimizers, just-in-time compilation systems, and 
operating systems. Hardware and software optimizations that 
can potentially benefit the most from phase classification and 
prediction are (1) those that need expensive profiling/training 
before applying an optimization, (2) those where the time or 
cost it takes to perform the optimization is either slow or ex- 
pensive, and (3) those that can benefit from specialization where 
they have the same code/data being used differently in different 
phases of execution. By using our dynamic phase tracking and 
prediction design, phase-behavior can be characterized and pre- 
dicted at the largest of scales, providing a unified mechanism for 
phase-directed optimization. 
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Abstract 

In a single second a modern processor can execute billions 
of instructions. Obtaining a bird's eye view of the behavior of a 
program at these speeds can be a difficult task when all that is 
available is cycle by cycle examination. In many programs, be- 
havior is anything but steady state, and understanding the pat- 
terns of behavior, at run-time, can unlock a multitude of opti- 
mization opportunities. 

In this paper, we present a unified profiling architecture that 
can efficiently capture, classify, and predict phase-based pro- 
gram behavior on the largest of time scales. By examining the 
proportion of instructions that were executed from different sec- 
tions of code, we can find generic phases that correspond to 
changes in behavior across many metrics. By classifying phases 
generically, we avoid the need to identify phases for each opti- 
mization, and enable a unified prediction scheme that can fore- 
cast future behavior. Our analysis shows that our design can 
capture phases that account for over 80% af execution using less 
that 500 bytes of on-chip memory. 

1 Introduction 

Modern processors can execute upwards of 5 billion instructions 
in a single second, yet most architectural features target program 
behavior on a time scale of hundreds to thousands of instruc- 
tions, less than half a fiS. While these optimizations can provide 
large benefits, they are limited in their ability to see the program 
behavior in a larger context. 

Recently there has been a renewed interest in examin- 
ing the run-time behavior of programs over longer periods of 
time [10, 11, 19, 20, 3]. It has been shown that programs can 
have considerably different behavior depending on which por- 
tion of execution is examined. More specifically, it has been 
shown that many programs execute as a series of phases, where 
each phase may be very different from the others, while still hav- 
ing a fairly homogeneous behavior within a phase. Taking ad- 
vantage of this time varying behavior can lead to, among other 
things, improved power management, cache control, and more 
efficient simulation. The primary goal of this research is the de- 
velopment of a unified run-time phase detection and prediction 
mechanism that can be used to guide any optimization seeking 
to exploit large scale program behavior. 

A phase of program behavior can be defined in several ways. 
Past definitions are built around the idea of a phase being an in- 
terval of execution during which a measured program metric is 
relatively stable. We extend this notion of a phase to include all 
similar sections of execution regardless of temporal adjacency. 
Simply put, if a phase of execution is correctly identified, there 



should only be small variations between any two execution in- 
tervals identified as being part of the same phase. A key point of 
this paper is that the phase behavior seen in any program metric 
is directly a function of the way the code is being executed. If 
we can accurately capture this behavior at run-time through the 
computation of a single metric, we can use this to guide many 
optimization and policy decisions without duplicating phase de- 
tection mechanisms for each optimization. 

In this paper, we present an efficient run-time phase tracking 
architecture that is based on detecting changes in the propor- 
tions of the code being executed. In addition, we present a novel 
phase prediction architecture that can predict, not only when a 
phase change is about to occur, but also the phase to which it 
is will transition. Since our phase tracking implementation is 
based upon code execution frequencies, it is independent of any 
individual architecture metric. This allows our phase tracker to 
be used as a general profiling technique building up a profile or 
database of architecture information on a per phase basis to be 
used later for hardware or software optimization. Independence 
from architecture metrics allows us to consistently track phase 
information as the program's behavior changes due to phase- 
based optimizations. 

We demonstrate the effectiveness of our hardware based 
phase detection and classification architecture at automatically 
partitioning the behavior of the program into homogeneous 
phases of execution and to identify phase changes. We show 
that the changes in many important metrics, such as IPC and en- 
ergy, correlate very closely with the phase changes found by our 
metric. We then evaluate the effectiveness of phase tracking and 
prediction for value profiling, data cache reconfiguration, and 
re-configuring the width of the processor. 

The rest of the paper is laid out as follows. In Section 2, 
prior work related to phase-based program behavior is discussed. 
Simulation methodology and benchmark descriptions can be 
found in Section 3. Section 4 describes our phase tracking ar- 
chitecture. The design and evaluation of the phase predictor are 
found in Section 5. Section 6 presents several potential applica- 
tions of our phase tracking architecture. Finally, the results are 
summarized in Section 7. 

2 Related Work 

In this Section we describe work related to phase identification 
and phase-based optimization. 

In [19], we provided an initial study into the time varying 
behavior of programs, showing that programs have repeatable 
phase-based behavior over many hardware metrics — cache be- 
havior, branch prediction, value prediction, address prediction, 



IPC and RUU occupancy for all the SPEC 95 programs. Looking 
at these metrics over time, we found that many programs have 
repeating patterns, and that important metrics tend to change at 
the same time. These places represent phase boundaries. 

In [20], we proposed that by profiling only the code that was 
executed over time we could automatically identify periodic and 
phase behavior in programs. The goal was to automatically find 
the repeating patterns observed in [19], and the lengths (peri- 
ods) of these patterns. We then extended this work in [21], using 
techniques from machine learning to break the complete exe- 
cution of the program into phases (clusters) by only tracking the 
code executed. We found that intervals of execution grouped into 
the same phase had similar behavior across all the architecture 
metrics examined. From this analysis, we created a tool called 
SimPoint [21], which automatically identifies a small set of in- 
tervals of execution (simulation points) in a program to perform 
architecture simulations. These simulation points provide an ac- 
curate and efficient representation of the complete execution of 
the program. 

The work of Dhodapkar and Smith [10, 9] is the most closely 
related to ours. They found a relationship between phases and 
instruction working sets, and that phase changes occur when the 
working set changes. They propose that by detecting phases and 
phase changes, multi-configuration units can be re-configured in 
response to these phase changes. They have used their working 
set analysis for instruction cache, data cache and branch predic- 
tor re-configuration to save energy [10, 9]. 

The work we present in this paper identifies phases and phase 
changes by keeping track of the proportions in which the code 
was executed during an interval based upon the profiler used 
in [20]. In comparison, Dhodapkar and Smith [10, 9] track the 
phase and phase changes solely upon what code was executed 
(working set), without weighting the code by its frequency of 
execution. Future research is needed to compare these two ap- 
proaches. 

Additional differences between our work include our exam- 
ination of architectures for predicting phase changes, and differ- 
ent uses from [10, 9], such as value profiling and processor width 
reconfiguration. We provide an architecture that can fairly accu- 
rately predict what the next phase will be, along with predicting 
when there will be a phase change. In comparison, Dhodapkar 
and Smith do not examine phase-based prediction [10, 9], but 
concentrate on detecting when the working set size changes, and 
then reactively apply optimization. 

Merten et al. [15] developed a run-time system for dynami- 
cally optimizing frequently executed code. Then in [3], Barnes 
et al. extend this idea to perform phase-directed complier op- 
timizations. The main idea is the creation of optimized code 
"packages" that are targeted towards a given phase, with the goal 
of execution staying within the package for that phase. Barnes et 
al. concentrate primarily on the compiler techniques needed to 
make phase-directed compiler optimizations a reality, and do not 
examine the mechanics of hardware phase detection and classi- 
fication. We believe that using the techniques in [3] in conjunc- 
tion with our phase classification and prediction architecture will 
provide a powerful run-time execution environment 
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3 Methodology 

To perform our study, we collected information for ten 
SPEC 2000 programs applu, apai, art, bzip, f acerec, 
galgel, gec, gzip, mcf , and vpr all with reference inputs. 
All programs were executed from start to completion using Sim- 
pleScalar [5] and Wattch [4]. Because of the lengthy simulation 
time incurred by executing all of the programs to completion, 
we chose to focus on only 10 programs. We chose the above 
10 programs since their phase based behavior represents a rea- 
sonable snapshot of the SPEC 2000 benchmark suite, along with 
picking some of the programs that showed the most interesting 
phase-based behavior. Each program was compiled on a DEC 
Alpha AXP-21164 processor using the DEC C, and FORTRAN 
compilers. The programs were built under OSF/1 V4.0 operating 
system using full compiler optimization (-04 - if o). 

The timing simulator used was derived from the Sim- 
pleScalar 3.0 tool set [5], a suite of functional and timing simu- 
lation tools for the Alpha AXP ISA. The baseline microarchitec- 
ture model is detailed in Table 1. In addition to this, we wanted 
to examine energy usage optimizations, so we used a version of 
Wattch [4] to capture this information. We modified all of these 
tools to log and reset the statistics every 10 million instructions, 
and we use this as a base for evaluation. 

4 Phase Capture 

In this section we motivate the occurrence of phase-based behav- 
ior, describe our architecture for capturing it, and examine the 
accuracy of using the program behavior in our phase-tracking 
architecture to identify phase changes for various hardware met- 
rics. 

4.1 Phase-Based Behavior 

The goal of this research is to design an efficient and general pur- 
pose technique for capturing and predicting the run-time phase 
behavior of programs for the purpose of guiding any optimiza- 
tion seeking to exploit large scale program behavior. Figure 1 
helps to motivate our approach to the problem. This figure shows 
the behavior of two programs, gec and gzip, as measured by 
various different statistics over the course of their execution from 
start to finish. Each point on the graph is taken over 10 mil- 
lion instructions worth of execution. The metrics shown are the 
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Figure 1: To illustrate the point that phase changes happen across many metrics all at the same time, we have plotted the value 
of these metrics over billions of instructions executed for the programs gcc (shown left) and gzip (shown right). Each point on 
the graph is an average over 10 million instructions. The number of unified L2 cache misses (ul2), the energy consumed by the 
execution of the instructions, the number of instruction cache (ill) misses, the number of data cache misses (dll), the number of 
branch mispredictions (bpred) and the average IPC are plotted. 



number of unified L2 cache misses (ul2), the energy consumed 
by the execution of the instructions, the number of instruction 
cache (ill) misses, the number of data cache misses (dll), the 
number of branch mispredictions (bpred) and the average IPC. 
The results show that all of the metrics tend to change in unison, 
although not necessarily in the same direction. In addition to 
this, patterns of recurring behavior can be seen over very large 
time scales. 

As can be seen from these graphs, even at a granularity of 10 
million instructions (which is at the same time scale as operating 
system time slices) there can be wildly different behavior seen 
between intervals. In this paper, we concentrate on a granularity 
of 10 million instructions because it is both outside the scope 
of normal architectural timing and is small enough to allow for 
many complex phase behaviors to be seen. 

4.2 Tracking Phases by Executed Code 

Our phase tracker architecture operates at two different time 
scales. It gathers profile information very quickly in order to 
keep up with processor speeds, while at the same time it com- 
pares any data it gathers with information collected over the long 
term. Additionally, it must be able to do all that while still being 
reasonable in size. 

Our phase profile generation architecture can be seen in Fig- 
ure 2. The key idea is to capture basic block information during 
execution, while not relying on any compiler support Larger 
basic blacks need to be weighed more heavily as they account 
for a more significant portion of the execution. To approximate 
gathering basic block information, we capture branch PCs and 
the number of instructions executed between branches. The in- 
put to the architecture is a tuple of information: a branch iden- 
tifier (PC) and the number of instructions since the last branch 



PC was executed. This allows us to roughly capture each basic 
block executed along with the weight of the basic block in terms 
of the number of instructions executed, as we did in [20, 21] for 
identifying simulation points. 

Classifying phases by examining only the code that is ex- 
ecuted allows our phase tracker to be independent of any in- 
dividual architecture metric. This allows our phase tracker to 
be used as a general profiling technique building up a profile or 
database of architecture information on a per phase basis to be 
used later for hardware or software optimization. Independence 
from architecture metrics is also very important to allow us to 
consistently track phase information as the program's behavior 
changes due to phase-based optimizations. 

At this point it is worth making more explicit the differences 
between our technique and that of Dhodapkar and Smith [10, 9]. 
Dhodapkar and Smith use a bit vector to track the working set of 
the code for a particular interval. While our technique is based 
on the basic block vectors used in [20]. The bit vectors of Dho- 
dapkar and Smith track a metric that is related to which code 
blocks were touched, whereas our metric tracks the proportion 
of time spent executing in each code block. This is a subtle but 
important distinction. We have found that in complex programs 
(such as gcc and gzip) there are many instructions blocks that 
execute only intermittently. When tracking the pure working set, 
these infrequently executed blocks can disguise the frequently 
executed blocks that dominate the behavior of the application. 
On the other hand, by tracking the frequency of code execution 
it is possible to distinguish important instructions (basic blocks) 
from a sea of infrequentiy executed ones. Examining these dif- 
ferences in more detail is a topic of future research. 

Another advantage of tracking the proportions in which the 
basic blocks are executed is that we can use this information to 
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Figure 2: Our phase classification architecture. Each branch PC 
is captured along with the number of instructions from the last 
branch. The bucket entry corresponding to a hash of the branch 
PC is incremented by the number of instructions. After each 
profiling interval has completed, this information is classified, 
and if it is found to be unique enough, stored in the past footprint 
table along with its phase ID, 

identify not only when different sections of code are executing, 
but also when those sections of code are being exercised differ- 
endy. A simple example is in a graphics manipulation program 
running a parameterized filter on an input image. If you run a 
simple 3x3 blur filter on an image you get very different behavior 
than if you run a 7x7 filter on the same image despite the fact that 
the same filter code is executing. The 7x7 filter will have many 
more memory references and those memory references conflict 
very differentiy in the cache than in the 3x3 case. We have seen 
this very behavior in examining the interactive graphics program 
xv. Using the proportion of execution for each basic block can 
distinguish these differences, because in the 3x3 filter the head 
of the loop is called more than twice as frequently as in the 7x7 
filter. 

The same general idea applies to other data structures as 
well. Take for example a linked list. As the number of nodes in 
the linked list traversal changes over different loop invocations, 
the number of instructions executed inside the loop versus the 
time spent outside the loop also changes. This behavior can be 
captured when including a measure of the proportion of the code 
executed, and this can distinguish between link list traversals of 
different lengths. 

4.3 Capturing the Code Profile 

To index into the accumulator table in Figure 2, the branch PC 
is reduced to a number from 1 to Nbuckets using a hash func- 
tion. We have found that 32 buckets is sufficient to distinguish 
between different phases even for some of the more complex 
programs such as gcc, A counter is kept for each bucket, and 
the counter is incremented by the number of instructions from 
the last branch to the current branch being processed. Each ac- 
cumulator table entry is a large (in this study 24-bit), saturating 
counter, which will not saturate during our profiling interval of 
10 million instructions. Updating the accumulator table is the 
only operation that needs to be performed at a rate equivalent to 



the processor's execution of the program (once for every branch 
executed). In comparison, the phase classification described be- 
low needs to only be performed once every 10 million instruc- 
tions (at the end of each interval), and thus is not nearly as per- 
formance critical. 

We note that the hashing function we use is fundamentally 
the same as the random projection method we used to generate 
phases in [21]. In this prior work, we make use of random pro- 
jections of the data to reduce the dimensionality of the samples 
being taken. A random projection takes trace data in the form of 
a matrix of size LxB, where L is the length of the trace and B is 
the number of unique basic blocks, and multiplies it by a random 
matrix of size B x N, where N is the desired dimensionality of 
the data which is much smaller than B. This creates a new ma- 
trix of size L x JV, which has clustering properties very similar 
to the original data. The random projection method is a powerful 
technique when used with clustering algorithms, and for captur- 
ing phase behavior as we showed in [21]. The hashing scheme 
we use in this paper is essentially a degenerate form of random 
projection that makes a hardware implementation feasible while 
still having low error. If all the elements of the random projec- 
tion matrix consist of either a 0 or a 1, and they are placed such 
that no column of the matrix contains more than a single 1 , then 
the random projection is identical to this simple hashing mech- 
anism. We have designed our phase classification architecture 
around this principle. 

Figure 3 shows the effect of applying the above mentioned 
technique for capturing the phase behavior of the integer bench- 
mark gzip. The x-axis of the figure is in billions of instructions, 
as is the case in Figure 1. Each point on the y-axis represents an 
entry of the phase tracker's accumulator table. Each point on the 
graph corresponds to the value of the corresponding accumulator 
table entry at the end of a profiling interval. Dark values repre- 
sent high execution frequency, while light values correspond to 
low frequency. The same trends that were seen in Figure 1 for 
gzip can be clearly seen in Figure 3. In both of these figures, 
when observing them at the coarsest granularity, we can see that 
there are at least three different phases labeled A, B and C. In 
Figure 3, the phase tracker table entries 2, 5 , 7 , 13 and 
17 distinguish the two identical long running phases labeled A 
from a group of three long running phases labeled C. Phase table 
entries 12 and 20 clearly distinguish phase B from both A and 
C. This figure is pictorial evidence that the phase tracker is able 
to break the program's execution into the corresponding phases 
based solely on the executed code, and that these phases corre- 
spond to the behavior seen across the different program metrics 
in Figure 1. 

4.4 Forming a Footprint 

After the profiling interval has elapsed, and branch block infor- 
mation has been accumulated, the phase must then be classified. 
To do this we keep a history of past phase information. 

If we fix the number of instructions for a profiling interval, 
then we can divide each bucket by this fixed number to get the 
percentage of execution that was accounted for by all instruc- 
tions mapped to that bucket However, we do not need to know 
the exact percentages for each bucket Instead of keeping the 



3 















0 


B 50B 100B 



Figure 3: Visualization of the accumulator table used to track 
program behavior for gzip. The x-axis is in billions of instruc- 
tions, white the y-axis is the entry of the accumulator table. Each 
point on the graph corresponds to the value of the accumulator 
table at the end of a profiling inten>al where dark values corre- 
spond to more heavily accessed entries. The same trends that 
were seen in Figure J can be clearly seen in Figure 3. 

full counter values, we can instead compress phase information 
down to a couple of the most significant bits. This compressed 
information will then be kept in the Past Footprint table as shown 
in Figure 2. 

The number of counter value bits that we need to observe is 
related to Nbuckets. As we increase the number of buckets, the 
data is spread over more buckets (table entries), making for less 
entries per bucket (better resolution) but at the cost of more area 
(both in terms of number of buckets and more bits per bucket). 
To be on the safe side, we would like any distribution of data into 
buckets to provide useful information. To achieve this we need 
to ensure that even if data is distributed perfectly evenly over 
all of the buckets, we would still record information about the 
frequency of those buckets. This can be achieved by reducing 
the accumulator counter by: 

[bucket\i] x Nbuckets) / (intervalsize) 

If the number of buckets and interval size are powers of two, 
this is a simple shift operation. For the number of buckets we 
have chosen (32), and the interval size we profile over, this re- 
duces the bucket size to 6 bits, and thus requires 24 bytes of stor- 
age for each unique phase in the Past Footprint table. In practice 
we see that the top 6 bits of the counter are more than enough 
to distinguish between two phases. In the worst case, you may 
need one or two more bits to reduce quantization error, but in 
reality we have not seen any programs that cause this to be an 
issue. 

If too few buckets are used, aliasing effects can occur due 
to the hashing function, where two different phases will appear 
to have very similar Footprints. Therefore, we want to use a 
sufficiently large number of buckets to uniquely identify the dif- 
ferences in code execution between phases, while at the same 
time use only a small amount of area. 

To examine the aliasing effect and determine what the appro- 
priate number of buckets should be, Figure 4 shows the sum of 
the differences in the bucket weights found between all sequen- 
tial intervals of execution. The y-axis shows the sum total of 
differences for each program. This is calculated by summing the 
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Figure 4: The percent difference found between Footprints from 
sequential intervals of execution, when varying the number of 
counters used to represent the footprints. The results are nor- 
malized to the difference between intervals found when having 
an infinite number of buckets to represent the footprint; 32 buck- 
ets captures most of the benefit. 

differences between the buckets captured for interval i and i — 1 
for each interval i in the program. The x-axis is the number of 
distinct buckets used. All of the results are compared to the ideal 
case of using an infinite number of buckets (or one for each sep- 
arate basic block) to create the Footprint. On the program gcc 
for example, the total sum of differences with 32 buckets was 
72% of that captured with an infinite number of buckets. In gen- 
eral we have found that 32 buckets was enough to distinguish 
between two phases. 

4.5 Classifying a Footprint to a Phase ID 

After reducing the vector to form a fooq>rint, we begin the clas- 
sification process by comparing the footprint to a set of repre- 
sentative past footprint vectors. We compare the current vector 
to each vector in the table. The next section details how we per- 
form the comparison and determine what a match is. If there is 
a match, we classify the profiled section of execution into the 
same phase as the past footprint vector, and the current vector 
is not inserted into the past footprint table. If there is no match, 
then we have just detected a new phase and hence must create a 
new unique phase ID into which we may classify it. This is done 
by choosing a unique phase ID out of a fixed pool of IDs. When 
allocating a new phase ID, we also allocate a new past footprint 
entry, set it to the current vector, and store with that entry the 
newly allocated phase ID. This allows future similar phases to 
be classified with the same ID. In this way only a single vector 
is kept for each unique phase ED, to serve as a representative of 
that phase. After a phase ID is provided for the most recent in- 
terval, it is passed along to prediction and statistic logging, and 
the phase identification part of our algorithm is completed. 

To examine the number of phase IDs we need to track, Fig- 
ure 5 shows the percentage of execution that can be accounted 
for by the tap p phases, where p is shown on the x-axis. Re- 
sults are graphed for the programs that had the min (galgel) 
and max (art) coverage, gcc, gzip, and the overall average. 
These results show that most of the program's phase behavior 
can be captured using a relatively small number of phase IDs. 



Figure 5: Results of the minimum number of pliases that need 
to be captured versus the amount program execution they cover. 
The y-axis is the percent of program execution that is covered. 
The x-axis is the minimum number of phases needed to capture 
that much program execution. 

If we only track and optimize for the top 20 phases in each ap- 
plication, we will capture and be able to accurately apply phase 
prediction/optimizations to over 90% of the program's execution 
on average. In the worst case (min), we are able to optimize most 
of the program (over 80%) by only targeting a small number (20) 
of important recurring phases. 

4.5.1 Finding a Match 

We search through the Footprint histories to find a match, but 
this query is complicated by the fact that we are not necessar- 
ily searching for an exact match. Two sections of execution that 
have very similar footprints could easily be considered a match, 
even if they do not compare exactly. To compare two vectors 
to one another, we use the Manhattan distance between the two, 
which is the element-wise sum of the absolute differences. This 
distance is used to determine if the current interval should be 
classified as the same phase ED as one of the past footprint inter- 
vals. 

If we set the distance threshold too low, the phase detection 
will be overly sensitive, and we will classify the program into 
many, very tiny phases which will cause us to lose any bene- 
fit from doing run-time phase analysis in the first place. If the 
threshold is too high, the classifier will not be able to distinguish 
between phases with different behavior. To quantify this effect, 
we examine how well our hardware technique classifies phases 
for a variety of thresholds compared to the phases found by the 
off-line clustering algorithm used in SimPoint [21]. 

The SimPoint tool is able to make global decisions to opti- 
mize the grouping of similar intervals into phases. The off-line 
algorithm makes no use of thresholds, instead its decisions are 
based solely on the structure found in the distribution of pro- 
gram behaviors. Our technique must be far more simplistic be- 
cause it must be performed on-line and with limited computa- 
tional overhead. This reduction in complexity comes at the cost 
of increased error. 

The Different Phases line in Figure 6 shows the ability of 
our hardware technique to find phase changes (transitions be- 
tween one phase and the next) when different thresholds are used 



Figure 6: Results showing how well our hardware phase tracker 
classifies two sequential intervals of execution as being from 
"Different" or the "Same" phase of execution. The percent of 
misclassifications are shown in comparison to (he phase classi- 
fications found using the off-line clustering SimPoint tool [21 ]. 

to perform the phase classification. For example, when using a 
Manhattan distance of 1 million as our threshold (shown as 20 
on our x-axis because it is in log 2 ), our hardware technique iden- 
tified 80% of the phase changes that occurred in the more com- 
plex off-line SimPoint analysis. Conversely, 20% of the phase 
changes were incorrectly classified as having the same phase ID 
as the last interval of execution. 

Likewise, the Same Phases line in Figure 6 represents the 
ability of our hardware technique to accurately classify two se- 
quential intervals as being part of the same phase as a function 
of different thresholds (again as compared to the off-line cluster- 
ing analysis). For example, when using a Manhattan distance of 
1 million (shown as 20 on the x-axis), our hardware technique 
identified 80% of the intervals that stayed in the same phase 
as correctly staying in the same phase, but 20% of those inter- 
vals were classified as having a different phase ID from the prior 
phase. 

A misclassification occurs when two sequential intervals of 
execution are classified as being in the same phase or in different 
phases using our hardware approach when the off-line clustering 
analysis tool found the opposite for these two intervals. 

If we are too aggressive and our hardware phase analysis in- 
dicates that there are phase changes when there are actually no 
noticeable changes in behavior, then we will create too many 
phase IDs that have similar behavior. This can create more over- 
head for performing phase-based optimization. On the other 
hand T if we are too passive in distinguishing between different 
phases, we will be missing opportunities to make phase specific 
optimizations. 

In order to strike a balance between having a high capture 
rate and reducing the percent of false positives, we chose to use 
a threshold of 1 million. When comparing this with the interval 
size of 10 million instructions, this means that a difference in the 
phase behavior will be detected if 10% of the executed instruc- 
tions are in different proportions. In choosing 1 million, we have 
on average a 20% misclassification rate. Note, that a misclassi- 
fication does not necessarily mean that an incorrect optimization 



will be performed. For example, if we have a "Same Phase" mis- 
classification (the two intervals were really from the same phase, 
but were classified into different phases), then a phase change is 
observed using our hardware technique when there was not one 
in the baseline classifier. If the two hardware detected phases 
have the same optimization applied to them, then this misclassi- 
fication can have no effect 

4.6 Per-Phase Performance Metric Homogeneity 

Using the techniques presented above, we can perform phase 
classification on programs at run-time with little to no impact 
on the design of the processor core. One of the goals of phase 
classification is to divide the program into a set of phases that are 
fairly homogeneous. This means that an optimization adapted 
and applied to a single segment of execution from one phase, 
will apply equally well to the other parts of the phase. In order to 
quantify the extent to which we have achieved this goal, we need 
to test the homogeneity of a variety of architectural statistics on 
a per-phase basis. 

Figure 7 shows the results of performing this analysis on the 
phases determined at run-time. Due to space constraints we only 
show results for two of the more complicated programs gcc and 
gzip. For both programs, a set of statistics for each phase is 
shown, The first phase that is listed (separated from the rest) as 
full, is the result of classifying the entire program into a single 
phase. The results show that for gcc for example, the average 
IPC of the entire program was 1.32, while the average number 
of cache misses was 445,083 per ten million instructions. In 
addition to just the average value, we also show the standard 
deviation for that statistic. For example, while the average IPC 
was 1.32 for gcc, it varied with a standard deviation of over 
43% from interval to interval. If the phase-tracking hardware is 
successful in classifying the phases, the standard deviations for 
the various metrics should be low for a given phase ID. 

Underneath the phase marked full are the five most fre- 
quently executed phases from the program as identified by our 
phase tracker. The phases are weighted by the percentage of the 
program's executed instructions they account for. For gcc, the 
largest phase accounts for 18.5% of the instructions in the entire 
program and has an average IPC of 0.61 and a standard devi- 
ation of only 1.6% (of 0.61). The other top four phases have 
standard deviations at or below this level, which means that our 
technique was successful at dividing up the execution of gcc 
into large phases with similar execution behavior with respect to 
IPC. Note, that some metrics for certain phases have a high stan- 
dard deviation, but this occurs for architecture features/metrics 
that are unimportant for that phase. For example, the phase that 
occurs for 7.2% of execution in gcc has only 75 LI instruction 
cache misses on average. This is an LI miss rate of 0.00075%, 
so an error of 215% for this metric will not likely have any effect 
on the phase. 

When we look at the energy consumption of gcc, it can be 
observed that energy consumption swings radically (a standard 
deviation of 90%) over the complete execution of the program. 
This can be seen visually in Figure I, which plots the energy 
usage versus instructions executed. However, after dividing the 
program into phases, we see that each phase has very little vari- 



ation within itself. All have less than 2% standard deviation. 
By analyzing gcc it can also be seen that the phase partitioning 
does a very good job across all of the measured statistics even 
though only one metric is used. This indicates that the phases 
that we have chosen are in some way representative of the actual 
behavior of the program. 

5 Phase Prediction 

The prior section described our phase tracking architecture, and 
how it can be used to classify phases. In this section we focus on 
using phase information to predict the next phase. For a variety 
of applications it is important to be able to predict future phase 
changes so that the system can configure for the code it will soon 
be executing rather than simply reacting to a change in behavior. 

Figure 8 shows the percentage interval transitions that are 
changes in phase, for our set of benchmarks. For all of these pro- 
grams, phase changes come quite often, but it should be noted 
that this statistic alone cannot gauge the complexity of the pro- 
gram behavior. The program gcc switches less than 10% of 
the time but switches between many different phases. The other 
extreme is art which switches almost half the time, but it is 
only switching between a few distinct phases. In this case, large 
repeating patterns can be observed. No two phases executing se- 
quentially are that similar, but there is an order to the sequence. 
By adding in a prediction scheme for these cases, we not only 
take advantage of stable conditions as in past research, but actu- 
ally take advantage of any repeating patterns in program behav- 
ior. 

5.1 Markov Predictor 

The prediction of phase behavior is different from many other 
systems in which hardware predictors are used. Because of this 
new environment, a new type of predictor has the potential to 
perform better than simply using predictors from other areas of 
computer architecture (branch and address prediction, memory 
disambiguation, etc.). 

After observing the way that phases change, we determined 
that two pieces of information are important. First, the set of 
phases leading up to the prediction are very important, and sec- 
ond, the duration of execution of those phases is important. 

A classic prediction model that is easily implementable in 
hardware is a Markov Model. Markov Models have been used 
in computer architecture to predict both prefetch addresses [13] 
and branches [8] in the past. The basic idea behind a Markov 
Model is that the next state of the system is related to the last set 
of states. 

The intuition behind this design is that phase information 
tends to be characterized by many sections of stable behavior 
interspersed with abrupt phase changes. The key is to be able to 
predict when these phase changes will occur, and to know ahead 
of time what phase they will change to. The problem is that the 
changes are often preceded by stable conditions, and if we only 
consider the last couple of intervals we will not be able to tell 
the difference between sections of stable behavior that precede 
a phase change, and those sections that will continue to be sta- 
ble. Instead, we need a way of compressing down stable phase 
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Figure 7: Examination ofper-phase homogeneity compared to the program as a whole (denoted by f u.11). For the two programs 
and each of the top 5 phases of each program, we show the average value of each metric and the standard deviation. The name 
of the phase is the percent of execution that it accounts for in terms of instructions. Tfiese results show that after dividing up the 
program into phases using our run-time scheme the behavior within each phase is quite consistent. 
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Figure 8: The percent of execution intervals that transition to 
a different phase from the prior execution interval's phase as 
found by our phase tracking architecture with 32 footprint coun- 
ters using a J million Manhattan threshold. 

information into a piece of information that we can use as state. 

5.2 Run Length Encoding Markov Predictor 

To compress the stable state we use a Run Length Encoding 
(RLE) Markov predictor. The basic idea behind the predictor 
is that it uses a run-length encoded version of the history to in- 
dex into a prediction table. The index into the prediction table is 
a hash of the phase identifier and the number of times the phase 
identifier has occurred in a row. 

Figure 9 shows our RLE Markov Phase ID prediction archi- 
tecture. The the lower order bits of the hash function provide an 
index into the prediction table, and the higher order bits of the 
hash function provide a tag. When there is a tag match, the phase 
ID stored in the table provides a prediction as to the next phase 
to occur in execution. When there is a tag miss, the prior phase 
ID is assume to be the next phase ED to occur in the program's 
execution. We found that predicting the last phase ID to be 75% 
accurate on average. 

We only update the predictor when there is (1) a change in 
the phase ID, or (2) when there is a tag match. We only insert an 
entry when there is a phase ID change, since we want to predict 
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Figure 9: Phase Prediction Architecture for the Run Length En- 
coded (RLE) Markov predictor. The basic idea behind the pre- 
dictor is that two pieces of information are used to generate the 
prediction, the phase id that was just seen, and the number of 
times prior to now that it has been seen in a row. The index into 
the prediction table is a hash of these two numbers. 

when the phase is going to change. Execution intervals where 
the same phase ID occurs several times in a row do not need 
to be stored in the table, since they will be correctly predicted 
as "last phase ID", when the there is a table miss. This helps 
table capacity constraints and avoids polluting the table with last 
phase predictions. For the second update case, when there is 
a tag match, we update the predictor because the observed run 
length may have potentially changed. 

5.3 Predictor Comparison 

We compare our RLE Markov phase predictor with other pre- 
diction schemes in Figure 10. This Figure has four bars for ev- 
ery program, and each bar corresponds to the prediction accu- 
racy of a prediction architecture. The first and simplest scheme, 
Last Phase, simply predicts that the next phase is the same as 
the current phase, in essence always predicting stable operation. 
The prediction accuracy of this scheme is inversely proportional 
to the rate at which phases change in a given benchmark. For 
the program gzip for example, there are long periods of execu- 
tion where the phase does not change, and therefore predicting 
no-change does exceptionally well. 

In order to insure that we were not simply providing an 



50%- 




Figure JO: Phase ID Prediction Accuracy. This figure shows 
how well different prediction schemes work. The most naive 
scheme, last, simply predicts that the phases never change. 
The bars marked Markov and RLE Markov show how well 
we can predict the phase identifiers if we use a Markov predic- 
tion scheme with a Markov table size of 256 entries. 

expensive filter for noise in the phase IDs, we also compared 
against a simple noise filter which works by predicting that the 
next phase will be the most commonly occurring of the last three 
phases seen. This is not shown, as it actually performed worse 
on all of the programs. 

Additionally we wanted to examine the effect of a simple 
Markov model predictor for history lengths of 1 and 2. The 
Markov model predictor does a better job of predicting phase 
transitions than Last Phase, but it is limited by the fact that 
long runs will always be predicted as infinitely stable due to the 
history filling up. However, it is still very effective for f acerec 
and applu, but does not provide much benefit for either art or 
galgel. 

The final bar, RLE Markov, is our improved Markov pre- 
dictor which compresses stable phases into a tuple of phase 
id and duration. All of the Markov predictors simulated had 
256 entries taking up less than 500 bytes of storage. Using 
RLE Markov outperforms both the Last Phase and tradi- 
tional Markov on all the benchmarks. It performs especially 
well compared to other schemes on both applu and art. Over- 
all, using a Run-Length Encoded Markov predictor can cut the 
phase mispredictions down to 14% on average. 

6 Applications 

This section examines three optimization areas in which a phase- 
aware architecture can provide an advantage. We begin by ex- 
amining the relationship between phase behavior and value lo- 
cality. We then demonstrate ways to reduce processor energy 
consumption by adjusting the aggressiveness of the data cache 
and the instruction front end. 

6.1 Frequent Value Locality 

Prior work on value predictors has shown that there is a great 
deal of value locality in a variety of programs [14, 7]. Recently, 
researchers have started to take advantage of frequently loaded 



values for the purpose of optimizing caches. For example, Yang 
and Gupta [22] proposed a data cache organization that com- 
presses the most frequently used program values in order to save 
energy. Another way of exploiting value locality is through value 
specialization, which can be done either statically or dynami- 
cally [6, 17, 16] to create specialized versions of procedures or 
code-regions based upon the values frequently seen. These tech- 
niques are built on the idea of finding the most frequent values 
for loads over the whole program, and then specializing the pro- 
gram to those frequent values. 

We examine the potential of capturing frequent values on a 
per-phase basis and compare this to the frequent values aggre- 
gated over the entire program, as would be used in value code 
specialization [6]. To perform this experiment we first gathered 
the top 16 values that were loaded over the complete execution 
of the program and stored them into a table. We then examined 
the percentage of executed loads that found their loaded value in 
this table. This result is shown as Static in Figure 11. While 
significant portions of some programs are covered by just these 
few top values (such as applu), over half of the programs have 
less than 10% of their loaded values covered by these top values. 

The question is: can we do better by exploiting hardware- 
detected phase information? To answer this question we take the 
top 16 values for each phase, as detected by the hardware phase 
tracker. These top values will be shared across a single phase 
even if it is split into two or more different sections of execution. 
Each load in the program is then checked against the top val- 
ues for its corresponding phase. The Phase Coverage bar 
in Figure 1 1 shows the percent of all load values in the program 
that were successfully matched to it's per-phase top value set. 

Without any notion of loads or values, our method of divid- 
ing up phases is very successful at assisting in the search for fre- 
quent values. By just tracking the top 16 values of each phase, 
we are able to capture the values from almost 50% of the exe- 
cuted loads on average. The Perfect bar shows percentage of 
loads covered if one captures the top 16 load values for each and 
every interval (i.e., 10 million instructions) separately. This is in 
effect the best that we could hope to achieve for an interval size 
of 10 million instructions, because the 16 entries in the value ta- 
ble are custom crafted for each interval individually. As shown 
in Figure 11 , the phase-tracker compares favorably with the op- 
timal coverage. Two thirds of the total possible benefit from 
per-interval value locality can be captured by per-phase value 
locality. It is important to point out this graph by itself is not 
a good indicator of usefulness as near perfect coverage could 
be achieved simply by making every interval a separate phase. 
However, as shown in Figure 5 only a few phases (around 20) 
are used to cover at least 80% of the program's execution. 

6.2 Dynamic Data Cache Size Adaptation 
In a modern processor a significant amount of energy is con- 
sumed by the data cache, but this energy may not be put to 
good use if an application is not accessing large amounts of data 
with high locality. To address this potential inefficiency, previ- 
ous work has examined the potential of dynamically reconfigur- 
ing the data caches with the intention of saving power. In [2], 
Balasubramonian et. al. present two different schemes with 
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Figure 11: The percent of the program's load values that are 
found in a table of the most frequently values loaded over the 
whole program (Static Coverage), on a per-phase basis (Phase 
Coverage), and on a per execution interval basis (Optimal Cov- 
erage). 

which re-configuration may be guided. In one scheme, hard- 
ware performance counters are read by re-configuration software 
every hundred thousand cycles. The software then makes a de- 
cision based on the values of the counters. In another scheme, 
re-configuration decisions are performed on procedure bound- 
aries instead of at fixed intervals. To reduce the overhead of re- 
configuration, software to trigger re-configuration is only placed 
before procedures that account for more than a certain percent- 
age of execution. 

Another form of re-configurable cache that has been pro- 
posed dynamically divides the data cache into multiple parti- 
tions, each of which can be used for a different function such 
as instruction reuse buffers, value predictors, etc [18]. These 
techniques can be triggered at different points in program exe- 
cution including procedure boundaries and fixed intervals. The 
overhead of re-configuration can be quite large and making these 
policy decisions only when the large scale program behavior 
changes, as indicated by phase shifts in our hardware tracker, 
can minimize overhead while guaranteeing adequate sensitivity 
to attain maximum benefit. 

We examined the use of phase tracking hardware to guide an 
energy aware, re-sizable cache. The energy consumption of the 
data cache can be reduced by dynamically shifting to a smaller, 
less associative cache configuration for program phases that do 
not benefit significantly from more aggressive cache configura- 
tions. By targeting only those phases that are predicted to have 
energy savings due to cache size reduction, our scheme is able 
to reduce power with very little impact on the performance. 

We examined an architecture with two possible cache con- 
figurations, 32KB 4-way associative and 8KB direct mapped. In 
Figure 12, the trade off between these two configurations is plot- 
ted. For each program, we use the 32KB cache configuration as 
the baseline result. The labeled circles in Figure 12 show the 
total processor energy savings and performance degradation for 
each program if only the smaller (8KB) cache size is used. For 
example, a processor with a smaller cache configuration for the 
program applu is both 5% slower and uses 5% less energy. 
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Figure 12: Data Cache Re-configuration. The tradeoff between 
energy savings and slowdown for two different cache policies. 
All results are relative to a 32KB 4-way associative cache. The 
circles in the graph (each labeled with a number for the program 
the data point is from) show the energy and performance of an 
8KB direct mapped cache. The triangles show the tradeoff of in- 
telligently switching between an 8KB direct mapped and a 32KB 
4-way data cache based on phase classification and prediction. 

Two programs, vpr and apsi, actually use more energy with a 
smaller cache due to large slow downs. These two points are off 
the scale of this graph and are not shown. 

While examining energy savings and slow down is interest- 
ing, it is important to note that there is more than one way to 
reduce both energy and performance. Voltage scaling in particu- 
lar has proven to be a technology capable of reaping large energy 
savings for a relative reduction in performance. For our results, 
we assume that for voltage scaling a performance degradation of 
5% will yield an approximate energy saving of 15%. We use this 
rule of thumb as our guideline for determining when to reduce 
the active size of the cache. In Figure 12, this simple model of 
voltage scaling is plotted as a dashed line. When the cache size 
is reduced, most programs fall far short of this baseline, meaning 
that voltage scaling would provide a better performance-energy 
tradeoff. There are a couple of exceptions, in particular mcf, 
bzip, and gzip do well even without any sort of phase-based 
re-configuration. 

The shaded triangles in Figure 12 show what happens if 
we use phase classification and prediction to guide our re- 
configuration. When a new phase ID is seen, we sample the IPC 
and energy used for a few intervals using the 32KB 4-way cache, 
and a few intervals for the 8KB direct mapped cache. These sam- 
ples could be kept in a small hardware profiling table associated 
with the phase ID. After taking these samples, if we find that a 
particular phase is able to achieve more than three times the en- 
ergy savings relative to the slow down seen when using the 8KB 
cache, we then predict for this phase ID that the smaller cache 
size should be used. This heuristic means that the small cache 
size is used only if re-configuration would beat voltage scaling 
for that phase. After a decision has been made as to the con- 
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Figure 13: Processor Width Adaptation. The tradeoff between 
energy savings and slowdown for two different front end poli- 
cies. All results are relative to an aggressive 8-issue machine. 
The circles in the graph (each labeled with a number for the 
program) show the energy and performance of a less aggressive 
2-isstte processor. The triangles show using the phase classifier 
and predictor for switching between 2-issue and 8-issue based 
on phase changes. 

figuration to use for a phase ID, the corresponding cache size is 
stored in the phase profiling table/database associated with that 
phase ID. The phase classifier and predictor are then used to pre- 
dict when a phase change occurs. When a phase change predic- 
tion occurs, the predicted phase ID looks up the cache size in the 
profiling table, and re-configures the cache (if it is not already 
that size) at the predicted phase change. 

For all programs, our re-configuration is able to beat 
or tie voltage scaling. For example, using phase-based re- 
configuration results in a slowdown of 0.5% for applu, while 
the total energy savings is 4.5%. Even the program apsi, which 
had increased energy consumption in the small cache configura- 
tion, is able to get almost 5% energy savings with only a 1% 
slowdown. 

6.3 Dynamic Processor Width Adaptation 

One way to reduce the energy consumption in a processor is to 
reduce the number of instructions entering the pipeline every cy- 
cle [12, 1]. We call this adjusting the width of the processor. 
Reducing the width of the processor reduces the demand on the 
fetch, decode, functional units, and issue logic. Certain phases 
can have a high degree of instruction level parallelism, whereas 
other phases have a very low degree. Take for example the top 
two phases for gcc shown in Figure 7. The intervals classified 
to be in the first phase consisting of 1 8.5% of execution have an 
IPC of 0.61 with a high data cache miss rate. In comparison, 
the intervals in the second most frequently encountered phase 
(accounting for 18.1% of execution) have an IPC of 1.95 and 
very low data cache miss rates. We can potentially save energy 
without hurting performance by throttling back the width of the 
processor for phases that have low IPC, while still using aggres- 
sive widths for phases with high IPC. 



In the current literature, decisions to reduce or increase the 
fetch/decode/issue bandwidth of the processor are made either 
at fixed intervals (relatively short intervals such as 1,000 cy- 
cles) [12] or, as in the case of branch confidence based schemes, 
when a branch instruction is fetched [1]. It can very difficult to 
design real systems that save energy by reconfiguring at these 
speeds, but a hardware phase-tracker can help make these deci- 
sions at a coarser granularity while still maintaining performance 
and energy benefits. 

We examined an architecture that could be configured with 2 
different widths - one where up to 2 instructions are decoded and 
up to 2 issued per cycle, and one where up to 8 instructions are 
decoded and up to 8 issued per cycle. When a new phase ID is 
seen by the phase tracker, we sample the IPC for three intervals 
with a width of 2 instructions, and three intervals with a width 
of 8 instructions. If there is little difference in the IPC between 
these two widths, then we assign a width of 2 instructions to this 
Phase ID in our profiling table, otherwise we assign a width of 
8 instructions. During execution, we use the phase ID predictor 
to effectively predict the width for the next interval of execution 
and adjust the processor's width accordingly. Our results show 
that the chosen configuration for a given phase can be trained 
(1) with only a few samples, and (2) only once to accurately 
represent the behavior of a given phase ID. This requires very 
little training time due to the fact that 20 or fewer phase IDs 
are needed to capture 80% or more of a program's execution as 
shown in Figure 5. 

Figure 13 is a graph of the results seen when applying phase- 
directed width re-configuration. The white circles in the graph 
show the behavior of running the programs on only a 2-wide 
machine relative to the more aggressive 8-wide machine. The 
dotted fine again shows what could potentially be achieved if 
voltage scaling was used. While mcf and art save a lot of en- 
ergy with little performance degradation on a 2-wide machine, 
the other programs do not fair as well. The program apsi, for 
example, has a slowdown of over 22% with an energy savings of 
around 30%. This does not compare favorably to voltage scal- 
ing (as discussed in Section 6.2). On the other hand if we use 
phase-directed width throttling on apsi, a total processor en- 
ergy savings of 18% can be achieved with only 2.2% slowdown. 

For all of the programs we examined, with one exception, 
the slowdown due to phase aware width throttling was less than 
4%, while the average energy savings was 19.6%. This result 
demonstrates that there is significant benefit to be had in the re- 
configuration of processor front end resources even at very large 
granularities. In the worst case, this will mean a re-configuration 
every 10 million instructions, and on average every 70 million 
instructions. This should be designable even under conservative 
assumptions. 

7 Summary 

In this paper we present an efficient run-time phase tracking ar- 
chitecture that is based on detecting changes in the code being 
executed. This is accomplished by dividing up all instructions 
seen into a set of buckets based on branch PCs. This way we ap- 
proximate the effect of taking a random projection of the basic 



block vector, which was shown in [21] to be an effective method 
of identifying phases in programs. 

Using our phase classification architecture with less than 500 
bytes of on-chip memory, we show that for most programs, a sig- 
nificant amount of the program (over 80%) is covered by 20 or 
less distinct phases. Furthermore, we show that these phases, 
while being distinct from one another, have fairly uniform be- 
havior within a phase, meaning that most optimizations applied 
to one phase will work well on all intervals in that phase. In the 
program gcc, the IPC attained by the processor on average over 
the full run of execution is 1.32, but has a standard deviation 
of more than 43%. By dividing it up into different phases, we 
achieve much more stable behavior, with IPCs ranging between 
0.61 and 1.95, but now with standard deviations of less than 2%. 

In addition to this, we present a novel phase prediction archi- 
tecture using a Run Length Encoding Markov predictor that can 
predict not only when a phase change is about to occur, but to 
which phase ID it will transition to. In using this design, which 
also uses less than 500 bytes of storage, we achieve a phase 
prediction miss rate of 10% for applu and 4% for apsi. In 
comparison, always predicting that the phase will stay the same 
results in a miss rate of 40% and 12% respectively. 

We also examined using our phase tracking and prediction 
architecture to enable new phase-directed optimizations. Tra- 
ditional architecture and software optimizations are targeted at 
the average or aggregate behavior of a program. In comparison, 
phase-directed optimizations aim at optimizing a program's per- 
formance tailored to the different phases in a program. In this pa- 
per, we examined using phase tracking and prediction to increase 
frequent value profiling coverage, and to provide energy savings 
through data cache and processor width re-configuration. 

We believe our phase tracking and prediction design will 
open the door for a new class of run-time optimization that tar- 
gets large scale program behavior. Even though we present a 
hardware implementation for phase tracking, a similar design 
can be implemented in software to perform phase classification 
for run-time optimizers, just-in-time compilation systems, and 
operating systems. Hardware and software optimizations that 
can potentially benefit the most from phase classification and 
prediction are (I) those that need expensive profiling/training 
before applying an optimization, (2) those where the time or 
cost it takes to perform the optimization is either slow or ex- 
pensive, and (3) those that can benefit from specialization where 
they have the same code/data being used differendy in different 
phases of execution. By using our dynamic phase tracking and 
prediction design, phase-behavior can be characterized and pre- 
dicted at the largest of scales, providing a unified mechanism for 
phase-directed optimization. 
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Joel and Rich Wolski; May 3 1 , 2002 

a A Modular Scheduling. Approach for Grid Application Development Environments. CS2002-0708. Holly Dail, Henri Casanova and Fran Berman; June 5, 2002 
o JBIG ComiwcBsinn Algorithms for "Dummy Ftlli* VLSI Layout Data. CS2002-Q709, Robert Ellis, Andrew Kahng and Yuhong Zheng; June 14, 2002 
o Phase Tracking and Prediction. CS2002-0710, Timothy Sherwood, Suleyman Sair and Brad Calder, June 23, 2002 

o Optimized Trace Binaries for Architectural Evaluation. CS20O2-O7 1 1 , Suleyman Sair. Yuanfang Hu, Timnlhy Sherwood and Brad Calder, June 23, 2002 

o Pointer Cache Assisted Speculative Precompuiatioti. CS2002-O7 12, Jamison Col lins, Suleyman Sair, Brad Colder and Dean Tullsen; June 23, 2002 

o ActiveCampus -Sustnininu Educational Communities through Mobile Technology, CS2002-O7I4. William G. Griswold. Robert Boyer, Steven W. Brown, Tun 

Minh Truong, Ezckiel Bhasker, Gregory R. Jay and R. Benjamin Shapiro; July 8,2002 
a The AcliveClnss Project: Experiments in Encouraging Classroom Participation. CS2002-0715. Tan Minh Truong, William G. Griswold, Matthew Ratio and 

Susan Leigh Stan July 8, 2002 - 
a Learning, the k in k-means. CS2002-07I6, Greg Hamerly and Charles Elkan; July 30, 2002 

a The Y-nrchileciure: Yet Another Qn-Chip Interconnect Solution- CS2002-O7I7. Hongyu Chen, Feng Zhou and Chung-Kuan Cheng; August 7, 2002 
o Fast and Scalable Conflict Detection lor Packci Classifiers. CS2002-071 8, Florin Baboescu and George Varghese; August 7, 2002 

o Packet Classification far Core Routers: Is there an nllernnlivcs to CAMs?. CS20D2-07 1 9, Florin Baboescu, Sumeet Singh and George Varghese; August 7. 2002 

□ Resource Allocation forSteerable Parallel Parameter Searches: an Experimental Sludy. CS20Q2-O720, Marcio Faerman, Adam Bimbaum, Henri Casanova and 
Fran Berman; August 18, 2002 

o A Multi-Round Algorithm for Scheduling Divisible Workload Applications: Analysis and Experimental Evaluation, CS2002-0721, Yang Yang ond Henri 
Casanova; September 26, 2002 

o Synchronous Consensus for Dependent Process Failures, CS2002-O722, Flavio Junqucira and Keith Marzullo; October 3, 2002 

o Coning with Dependent Process Failures. CS2002-0723. FlavtD Junqueira, Keith MaraullD and M. Voelker Geoffrey; October 7, 2002 

o Usiim Mobile Technology lo Create Opportunistic Interactions on a University Campus. CS2002-Q724. William G. Griswold. Robert Boyer, Steven W. 

Brown, Tan Minh Truong, Ezekiel Bhasker, Gregory R, Jay and R. Benjamin Shapiro; October 1 6. 2002 
o Group Membership and Wide-Area Master-Worker Compulal ions. CS2002-0725, Kjetil Jacobsen, Xianan Zhang and Keith Marzullo; November 6, 2002 
a Replication Strnteufes for Highly Avail nhle Peer-lo-peer Storage Systems. CS2Q02-O726. Ranjita Bhagwan. Sterun Savage and Geoffrey M, Voelker, November 

6,2002 

o Using SimPoinls in Diverse Simulation Environments, CS2002-O727, Erez Perelman, Michoel Van Biesbrouk, Timothy Sherwood and Brad Colder, November 
16.2002 

o Interaction of Virtual Machine xv-iih the Operaling System, CS2002-0728, Kiran Tali and GeofTrey M. Voelker; December 2, 2002 
o Whole Pnitu Performance. CS2002-O729, Lecann Bent and GeofTrey M. Voelker, December 16, 2002 

o HYPERCUTS: A Dcctsinn Tree Based Algorithm for Fast Packet Classification. CS2002-0730, Sumeet Singh, George Varghese and Florin Baboescu; 
December 12, 2002 

o Security in the Sand nary System, CS2002-0731, Matthew Hohlreld, Aditya Ojha and Bennet Yce; December 20, 2002 

• 2003 ~ 

a The Phoenix Recovery System: Rebuilding from the ashes of nn Internet catastrophe. CS2003-0732. Flavio Junqueira, Ranjila Bhagwan, Keith Marzullo, Stefan 

Savage and Geoffrey M. Voelker; January 13, 2003 
a Connectivity in the South American Inlemel. CS2QD3-0733, Flavio Junqueira and Renata Teixcfra; January 13. 2003 

o Luwcr Bound on the Number of Rounds for Consensus with Dependent Process Failures- CS20D3-0734, Flavio Junqueira and Keith Marzullo; January 13. 2003 

□ MP1 Process Swapping: Architecture and Experimental Verification. CS2003-0735, Otto Sievert and Henri Casanova; January 29, 2003 

u Packet Classification Using Multidimensional Cutting. CS2003-O736, Sumeet Singh, Florin Baboescu, George Varghese and Jia Wang; February 7, 2003 
o Consensus for Dependent Process Failures, CS2003-O737, Flavio Junqueira ond Keith Marzullo; February 1 8, 2003 

d Bitmap algorithms for counting active mows on high speed links. CS2003-073B, Crislian Estan, George Varghese and Mike Fisk; March 13, 2003 
o Query Set Speclflcntipn Language (QS5LV CS2003-0739, Michalis Petropoulos, Alin Deutsch and Yonnis Papakonslnntinou; March 24, 2003 
a Incrcasinu Object Visibiljiy In Dcconlnilized Unstructured Pecr-To-Pccr Ndworks Usini; Conlenl Based Routing. CS2O03-O740, Sumeel Singh ond Florin " 
Baboescu; March 28, 2003 

o Proof orCornictness for Sparse Tilinu ufQauss-Scidel. CS2003-074 1 , Michelle Mills Slrout. Larry Carter and Jeanne Ferrante; April 1 , 2003 

o The case for ISP deployment ol'surjcr-pccrs in P2P networks. CS2003-0742, Sumeet Singh, Sriram Rnmabhadran. Florin Baboescu and Alex Snoeren; April IS, 
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2003 

o On the Generalfzarion of n>k't. CS2003-0743. Flavin Junqueira and Keith Marzullo; April 21. 2003 

a A Bow-based Task Scheduling Slratcuy for Distributed Systems. CS2003-0744. Sagnik Nandy. Jeanne Fcrrante and Lorry Carter. May 2. 2003 
a Real-lime Detection of Known and Unknown Warms. CS2Q03-0745, Sumeet Singh, Cristian Estan, George Varghese Bnd Stefan Savage; May 30, 2003 
a Automatically Infcrrinu Patlerns of Resource Consumption in Network Traffic. CS2003-O746. Cristian Estnn, Stefan Savage and George Varghese; June 2, 
2003 

o The Measurement Manifesto. CS20D3.0747, George Varghese and Cristian Estnn; June 4, 2003 

o Online Lund Balancing and First-Hop Bandwidth Allocation in Public-Area Wireless Networks. CS2003-O74&. Aniind Bnlachandran, Sngnik Nandy, Venfcal P. 

Rangan and GeofTrey M. Voelker. June 10. 2003 
o The Imparl ofAddrcss Allocnlbn and Routing, an the Structure and Implementation ofRoutinKTnhles, CS20O3-O749, Harsho Narayan. Ramesh Oovindan and 

George Vorghcse; June 19. 2003 

o AclivcCnmpus - Experiments in Communilv-Orienled Ubiquitous Computing. CS2003-0750, William G. Griswold, Patricia Shanahan. Steven W. Brown, 

Roberl Bayer, Matt Ratto, R. Benjamin Shapiro and Tan Minh Truong; June 24, 2003 
a Buckinu Free-Riders: Distributed Accounting and Settlement in Peer-lo-Pcer Networks. CS2003-O75 1, Abhishek Agmwal, Douglas Brown, Adiryn Ojhn ond 

Stcran Savage; June 24, 2003 
• ' a Application-Tumid Processor Architectures. CS2003-0752. Timothy Sherwood; June 25, 2003 . 

° Predictor-Directed Data Prefetching for Pointer-based Applications, CS2003-O753, Suleymnn Sair, June 25, 2003 

o Ex-iensfotisto the Multi-lnstallment Algorithm: AiTine Casts cndJ3ujpjjtJ3 a j a Transfers. CS2003-0754, Yang Yan E and Henri Casanovn; July 16, 2003 
o Cnne:Aunmcntinu DUTs to Support Distributed Resource Discovery. CS2003-O755, Ranjita Bhagwan, George Varghese and GeofTrey M. Voelker, July 21, 
2003 

D A Multiple Level Network Approach for Clock Skew Minimization with Process Variations. CS2003-0756, Makcto Mori. Hungyu Chen. Bo Yco and 

Chung-Kuan Cheng; July 28, 2003 
o Structures ond Algorithms for Phase Classification. CS2D03-07J7. Jeremy Lnu, Stefan Schocnmackers and Brad Caldcr. July 29,2003 

o GRYD: Generalized Reduced-Order Wye-Delta Transformation: Programmer's Manual for Reduction Engine and Annlicalinns. CS2003-0758, Zbanhai Qin and • 
Chung-Kuan Cheng; July 3 1, 2003 

" GRYD: Generalized Rcdticcd-Ordcr Wye-Dclln TranslfarmotiDn: User's Manual for Reduction Ennine ond Applications, CS20D3-0759, Zhnnhat Qin and 

Chung-Kuan Cheng; July 3 1, 2003 
o Benchmark Probes for Grid Assessment, CS2003-0760, Greg Chun, Holly Dail, Henri Casanova and Allan Snavely; August I, 2003 

o The EarlvBird System for Real-time Detection of Unknown Worms. CS20Q3-O761 , Sumeel Singh, Cristian Estnn, George Varghese and Stefan Savage; August 

4.2003 ' . 

° Segmentation bv Example. CS2003-0762. Snmeer Agarwol and Serge Belongie; August 14, 2003 

° Three Brown Mice: See How They Run. C52003-0763, Kristin Branson, Vincent Rabaud and Serge BelDngte; August 1 9, 2003 
d Approximation Methods forTliin Plata Spline Mappinus and Principal Warps. CS2003-0764. Gianluca Donalo and Serge Belongie; September 4, 20D3 
o Employinn User Feedhnck for Past. Accurate. Low-Maintenance Gealocalioninu. CS2003-0765. Ezekiel S. Bhasker, Steven W. Brown and William O. 
Griswold; September S, 20D3 

n An Adaptive System for Real-time Summaries of Internet Traffic. CS2003-076fj, Cristian Eslon, Ken Keys and David Moore; September 24. 2003 
o Structure from Periodic Morion. CS2003-O767, Serge Belongie and Josh Wills; October 10, 2003 

«• A Feature-bused Approach for Determining Dense Long Range Corrcsnondcnces. CS2003-076B, Josh Wills and Serge Belongie; October 20, 2003 

o Chcractcrfeing and Evaluating, Desktop Grids: An Empirical Study. CS2003-0769. Derrick Kondo, MichclaTnufer, John Kanwicolos, Charies L. Brooks, Henri 

Casanava and Andrew Chien; October 22, 2003 
o DGMonitPr: a Performance Mtinitnrinu Tool for Sundbox-based Desktop Grid Platforms. CS2003-0770. Pletro Cicotti, MichelaTnuferand Andrew Chien; 

October 24, 2003 ~" ' " 

° A Cti-Phn-iC Mntrix to Guide Simultaneous Multithreading Simulation. CS2D03-O77I. Michael Van Biesbrouck, Timothy Sherwood and Brad Caldcr; October 
2B.2003 

□ Structures far Phase Classification. CS2003-O772, Jeremy Lau, Stefan Schoenmackers and Brad Calder, October 28, 2003 

□ The Enlrnni'n Virtual Machine for Desktop Grids. CS2003-O773, Brad Caldcr, Andrew Chien, Ju Wang and Don Yang; October 2B, 2003 

o Code Puinter Protection From Buffer Overflow Through Targeted Hardware Encryption. CS20D3-O774, Nathan Tuck, Brad Calder and George Varghese; 
December 1,2003 

■ 2O04 

° One Dimensional Knapsack. CS2004-077S. T. C. Hu. M. T. ShinK and Leo Landn; January 14. 2004 

o Using Network How Buffering to Improve Performance of Video over HTTP. CS2004-0776. Jesse Steinberg and Joseph Posquale; January 14, 2004 

o A Nccr-Oplimnl Algorithm for a Localilv-Mnximizlnu, Placement Problem. CS2004-0777, Fan Chung, Ronald Graham, Ranjita Bhagwan, Stefan Savage and 

GeofTrey M. Voelker, January 16,2004 
o Tde-Rcalltv farllie Rest of Us. CS2004-0778. Neil McCurdy and William Griswold; January 16, 2004 

a Critical Poinls for Interactive Schema Matching. CS2004-0779, Guilian Wang, Joseph Goguen, Young-Kwong Nam and Kal Lin; January 30, 2004 

□ Accuss ond Mobil ily of Wireless PDA Users. CS20O4-0780, Mnrvin McNeil ond GeofTrey M. Voelker. February 9, 2004 

a Buildinn a Hierarchy of Variable LcnKih Intervals to Capture Hierarchical Phase Behavior. CS2004-078 1, Jeremy Lnu, Ercz Perclman, Greg Hamerty, Timothy 

Sherwood and Brad Calder, March 13, 2004 
o Coiict A Distributed Henri Approach to Resource Selection. CS2004-O7B2, Ranjita Bhagwan, Priyo Mahndevan, George Varghese and Geoffrey M. Voelker; 

March 22. 2004 

° Ontimizinu the Knapsack Problem. CS2OO4-0783. Leo Lnnda; April 2, 2004 

o Comparison belweun multistage fillers and sketches for finding heavy hitlers. C5ZO04-O784, Cristian Estnn; April 27, 2004 

o APST-DV: Divisible Load Scheduling, and Deployment on the Grid. CS2004-O78S. Krijn van dcr Rnadt, Yang Yang and Henri Casanova; April 28. 2004 
o OptlPutcr System Software Framework. CS2004-O786, Xinran (RyanJ Wu Andrew A. Chien Nut Tnesombut Eric Weigic Huaxia Xia and Justin Burke; April 
28, 2004 

o Sync-scan: A fast hand-off procedure for 802.1 1 link layer roam inn. CS2004-0787, ishwar ramani and Stefan savage; May 3, 2004 
o Undcislnnilinti When Localion-Hidint; Using Overlay Networks Is Feasible. CS2Q04-O788, Ju Wang and Andrew Chien; May 9, 2004 
o Peteelinn Malicious Routers. CS2004-O789, Alper Mizrak. Keith Marzullo and Stefan Savage; May 24, 2004 

o Scwi-paramctric (apanenliol family PCA : Rcdiicimt dimensions via ncin-nnramelric latent distribution estimation. CS20O4-O79O, Sajnma Sajama and Alon 
Orlitsky; June 2, 20D4 

d Fulcrum - An Open-Implementation Approach to Context-Aware Publish / Subscribe. CS2004-0791. Roberl T. Boyernnd William G. Griswold; June 8, 2004 
o MobiNct: A Scalable Emulation Infrastructure for Ad Hoc and Wireless Networks. CS2004-0792. Priyn Mahadcvan, Adolfo Rodriguei. David Becker and 

Amm Vahdat; June 14, 2004 
d Unified Summaries fiir Internet traffic. CS2004-0793. Crislian Estan; June 15, 2004 
a Snce Algorithms for Knapsack Problem. C52004-0794. Leo Landa; June IS, 2004 

o Network Telescopes: Technical Report. CS2004-079S, David Moore, Colleen Shannon, GeofTrey M. Voelker and Stefan Savage; July 7, 2004 
o Compullnu 1he Optimal Makespnn for Jobs with Identical and Independent Tasks Scheduled on Volatile Hosts. CS2D04-O79fi, Derrick Kondo and Henri 
Casanova; July 1 2, 2004 

o Usinu Prouram Phases as Metn-Datn for Runtime Energy Optimization. CS20D4^0797, Cristiano Pcrcira and Rajesh Gupta; July 14. 2004 

o Evaluation nfa Hieh Performance Erasure Code Implementation. CS20O4-0798. Frank Uvcda. Huaxia Xia and Andrew A. Chien: September 13. 2004 

o A New Direction in Tree Bnsud Search Engine Architectures Using Balanced Sinnle Pnn Memories, CS2004-0799, Florin Baboescu and Dean Tullsen; October 
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o Declarative Resource Naminit Tor MacronroBraniminu Wireless Networks ofEmbcddcd Systems. CS2004-0S00, Chalermek Inlanagonwiwat, Rajesh Gupla and 
Arain Vahdni; November 2. 2004 

o A Placement Methodology for Global Interconnect Reduction and lis Impact on Performance, CS2004-0801. Andrew Karma, tgar Markov and SheriefRedo; 
October 3 1. 2004 

o APST-DV: A Practical Framework for Scheduling and DenlovinR Divisible Loads on Orid Platforms. CS2004-O802, Krijn van der Roadt, Yang Yang and Henri 
Casanova; November 9, 2004 

o Efficient Sampling Startup for Uniprocessor and Simultaneous Mullithreadinu Simulation . CS2004-OB03. Michael Van Bicsbrouck, Lieven Ecckhoul and Brad 
Calder; November 28, 2004 

o Sulcctini: Soriv.ii re Phase Markers wMh Code Structure Analysis . CS2004-0804. Jeremy Uu, Erez Pereltnan and Brad Colder, November 28, 2004 

o Efficienl Bounds Clicckhm for C . CS2004-0805, Welha w Chuang, Satish Narayanasamy and Brad Calder, November 28, 2004 

o CL1DE: Interactively FormulnlinB Feasible Queries on Query Rcwrilinn-Based Systems. CS2004-0S07, Michalis Petropoulos, Alin Deuisch and Yonnis 

Fapakonstanltnou; December 12, 2004 
p Criticnl-Palh Aware Processor Architectures. CS20D4-0808. Eric Tune; December 16,2004 

d Eftkicnl Resource Description and Hinh OualiTV Selection for Virtual Grids. CS2004-0809. Yang-Suk Kee, Dionyslos Logothetis, Richard Huang, Henri 



o Limit results on pattern entropy. CS2004-081 1, Alon Orlitslcy, Narayana Santhonam, Krishnamurthy Viswanathan and Junan Zhang; December 27, 2D04 
. 200J 

□ Weak leader election for receivc-omission process failures, CS2005-OB 12, Flavio Junqueira and Keith Marzullo; January 26, 2005 
o A Systems Architecture fur Ubiquitous Video. CS2005-0813, Neil J. McCurdy and William G. Griswold; February 4, 20D5 

o Hamessimt Mobile Ubiquitous Video. CS2005-OB14. Neil J. McCurdy and William G. Griswold; February 4, 2005 

o Copjnu with Inicmcl catastrophes, CS2005-OBI6, Flnvio Junqueim, Ranjila Bhagwan, Alejandro Hevia, Keith Marzullo and GeofTrcy M, Voelker, February 17, 
2005 

□ The Vinunl Grid Description Lanj-uuRc: vnDL. CS2005-081 7, Andrew Chien, Henri Casanova, Yang-suk Kee and Richard Huang; February 1 8, 2005 

a NP-Complclcmas of the Divisible Load SchcdulintrProhlem on Hcteronencous Slar Platforms with A nine Costs. CS2005-O818. Arnttlid Legrand, Yang Yang 
and Henri Casanova; March 10, 2005 

□ Accuracy Bounds For The Scaled Bitmap Data Structure. CS2005-OB1 0 . Sumeet Singh, Cristian Estan, George Varflhese and Stefan Savage; March 22, 2005 

□ Placc-lis: Locnlion-Bascd Reminders on Mobile Phones. CS200S-0820. Timothy Sarin. Kevin Li, Gunny Lee, Ian Smith, James Scott and William Griswold; 
March 23, 2005 

a Auiomntic CnlorCnlibmtion for Large Camera Arrays. CS2005-0821, Neel Joshi, Bennett Wilbum, Vaibhnv Vnish, More Levoy Levoy ond Mark Horowitz; 
May I !., 20D5 

a The PDwernfSliciitB in Internet Flow Measurement. CS200S-O822, Rarriana Rao Estan Kc-mpella Cristian ; May 13. 2005 

□ Enhanced Dcsiun Flow and Optimizations for Multi-Proiccl Wafers. CS2005-0B23, Andrew Kahng, Ion Mnndoiu, Xu Xu and Alex Zclikovsky; May 14, 2005 
o The Overlay Network Content Distribuiion Problem. CS200S-OB24, Chip Killian, Michael Vrable, Alex C. Snoeren. Amin Vahdat and Joseph Pasquale; May 

18,2005 

o Combined Selection and Bindinc for Competitive Resource Environments. CS2005-0B25, Yang-Suk Kcc, Henri Casanova and Andrew A. Chien; May 18, 2005 
o Evaluating Location Based Reminders. C52005-OB26. Kevin A. Li. Timothy Sohn and William G. Griswold; May 1 8. 2005 

o Declarative Resource Naming for Mncroprogmmminu Wireless Networks of Embedded Systems. CS2005-O827, ChoJermek Intanagonwiwal, Rajcsh Gupta and 
Arain Vnhdat; May 30. 2005 

o Weak Leader Election in the receive-omission failure model. CS2005-0829. Flavio Junqueira and Keith Marzullo; June 1, 2005 
a EiTicient Cooperative Scheduling in 802.1 1 Wireless Nctworks.' CS2005-OB30. Ishwar Ranani, Romana Rao Kompella, Sriam Rai 
July 7, 2005 

o Coterie nvailahility in sites (extended version}. CS2005-083 1 , Flovio Junqueira rind Keith Marzullo; July 27, 2005 
o A Scalable Capstone Course for Academic Preparation. CS2005-OB32, Will torn G. Griswold; August 28, 2005 
° Reeounizinu Cars. CS2005-OB33, Louka Dlagnckov and Serge Bclongte; September 2B, 2005 
o Maximum Instantaneous Power Estimation by Subgraph Colorinu. CS2005-OB34, Ban Liu: October 12.2005 

o Fecdthroueh Channel Effeet on Wirclennth Distribution in the Presence of Obstacles. CS20D5-OB35, Andrew Cheng Kahng Chung-Kuan Liu Bao Stroobandl 
Dirk; October 12, 2005 

d Chnruc-Maichimi Tail Approximation in a Piece-wise Lineor-and-Exponenlial Function. CS20O5-O836. Bao Liu; October 12, 2005 

d Nl'-Cumpluieiicss mid Approximation Scheme ut'Zero-Siwv Clock Tree Problem. CS2P03-0B37, Bao Liu; October 13. 2005 

o Soltwurc I'ttifiliiiH for Dcieritiiiiiaiie Replay Deliuuiu'nit ofUser Code . CS2005-0839. Satish Narayanasarny and Brad Calder, October 18, 2005 

o Automatic LoituinK ofOperotiim System Effects to Simplify Application-Level Arehiteciure Simulation. CS2005-OB40. Satish Nnraynnasamy. Crisliano Pereira, 

Harish Patil, Roben Cohn and Brad Calder, October 18, 2005 
o Camparini! Multinomial and K-Mcans Clusterinu for SimPoint. CS2005-O84 1 , Greg Hamerly, Erez Perelman and Brad Calder, October 20, 2005 
o Peer-lo-Pecr Error Recovery for Hybrid Satellite-Terrestrial Networks. C52C05-OB42, Eric Weigle, Matti Hillunen, Rick Schtichtlng, Vinay Vaishampayan and 

Andrew A. Chien; October 3 1. 2005 

□ Efficient Hnrdwnre Support for Deterministic Replay DebuwRinK of Memory Races, Interrupts and Self Modifying. Code. CS2005-Q843, Satish Narayanasamy, 
Cristiano Pereira and Brad Calder, November 14, 2005 

o DeleclinH Phases in Parallel Applications on Shared Memory Architectures. CS200S-0B44. Erez Perelman, Morzia Polito, Jean- Yves Bouguet, John Sampson, 



d XML Queries Usinu Nested Views. CS2005-O846, Emiran Curtmola, Alin Deutsch, Nicola Onose and Yannis Fapakon 



y, Aysc Coskun and Brad Colder; December 18,2005 



Michael Van Bicsbrouck, Ganesh Venkatesh, Osvaido Colavin and Brad Colder, December 18,2005 
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sumentation b' 
■ Ayrawal, A. 

~ ucking Free-Riders: Distributed Accounting and Settlement in Peer-to-Peer Networks. CS2003-075 1 , June 24, 2003 



a Distributed Application Management Using Plush. C52006-OB64. July 31. 2006 
■Alvisi.L. 

□ Scalable Causal Message Logging for Widc-Arcn Networks. CS2000-06S 1. April 21. 2000 

□ Scalable Cpusnl Message Logging for Wide-Area Environments. CS2001-067 1. May 24.2001 

• Amer-Yahia, S. 

□ Flexible tmd Efficient XML Search with Complex Full-Text Predicates. CS2D0S-OB4S. December 12, 2005 

• Andrew A. 

□ OnllPuicr System Software Framework. CS2004-O7B6, April 28, 2004 
■ Atkinson, D. 

□ Implementation Tcchnluucs far Efficient Dnla-Flow Analysis ofLarge Programs. CS200I-0665, February 3, 2001 

□ A Comparative Study of Two Whole Premium Slicers forC. CS2001-0668, April 12, 2001 

• Austin, T. 

a A Power Efficient Speculative Fetch Arcliitcctur. CS2QQ0-0657, June 28, 2000 

□ Relieving Rceistur File and Instruction Window Pressure. CS20O1-06B9. November 20. 2001 

• Bnboescu, F. 

o Aggregolcd Bit Vector Search Algorithms for Paclccl Filter Lookups. CS2001-0673, June 3. 2001 
° Proxy Caching with Hash Functions . CS2001-0674. June 3.2001 

o Fast and Scalable Conflict Detection lor Packet Classifiers. CS2002-O71 8, August 7, 2002 
o Pmket CltiBsificnlinn Tor Core Routers: Is there nn alternatives to CAMs7. CS2002-O7 1 9, August 7, 20D2 
a HYPERCUTS: A Decision Tree Based Algorithm far Fast Packet Classificntion. CS2002-O730. December 12, 2002 
. a Packet Classification Using Multidimensional Culling. 052003-0736. February 7, 2003 

o Increasing Ohiect Visibility In Decentralized Unstructured Peer-To-Peer Networks Using Content Based Routine. CS2003-0740, March 28, 2003 
o Tha case for ISP deployment nfsuper-riecrs in P2P networks. CS2003-O742. April IS, 2003 

o A New Direction in Tree Based Search Engine Architectures Using Balanced Simile Port Memories. CS2004-O799. October IS, 2004 

• Balachnndrnn, A. 

o Online Load Balancing und Firsl-Hon Hnndwidth Allocation in Public-Area Wireless Nulworks. CS2003-074B. June 10,2003 

• Bnnol.T. 

o The Virtual Instrument: Support fur Grid-enabled Scientific Simulations. CS2002-0707, Moy 31, 2002 

• Becker, D. 

o MobiNct: A Scalable Emulation Infrastructure for Ad Hoe and Wireless Net works. CS20O4-0792. June 14,2004 

• Belew, R. 

. o packing iDpical hierarchies: A comparison of two algorithms for reconciling keyword structures. CS2001-O6fi9. April 26,2001 
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* ABSTRACT 
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the complete execution of the program). This realization has ramifications for many architectural and 
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end we examine the use of Basic Block Vectors, We quantify the effectiveness of Basic Block Vectors 
in capturing program behavior across several different architectural metrics, explore the large scale 
behavior of several programs, and develop a set of algorithms based on clustering capable of 
analyzing this behavior. We then demonstrate an application of this technology to automatically 
determine where to simulate for a program to help guide computer architecture research. 
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Abstract 

Understanding program behavior is at the foundation of 
computer architecture and program optimization. Many pro- 
grams have wildly different behavior an even the very largest 
of scales (over the complete execution of the program). This 
realization has ramifications for many architectural and com- 
piler techniques, from thread scheduling, to feedback directed 
optimizations, to the way programs are simulated. However, 
in order to take advantage of time-varying behavior, vie must 
first develop the analytical tools necessary to automatically 
and efficiently analyze program behavior over large sections 
of execution. 

Our goal is to develop automatic techniques that are ca- 
pable of finding and exploiting the Large Scale Behavior of 
programs (behavior seen over billions of instructions). The 
first step towards this goal is the development of a hardware 
independent metric that can concisely summarize the beliav- 
ior of art arbitrary section of execution in a program. To 
this end we examine the use of Basic Btock Vectors. We 
quantify the effectiveness of Basic Btock Vectors in capturing 
program behavior across several different architectural met- 
rics, explore the large scale behavior of several programs, and 
develop a set of algorithms based on clustering capable of an- 
alyzing this behavior. We tiien demonstrate an application of 
this technology to automatically determine where to simulate 
for a program to help guide computer architecture research. 

1. INTRODUCTION 

Programs can have wildly different behavior over their run 
time, and these behaviors can be seen even on the largest of 
scales. Understanding theae large scale program behaviors 
can unlock many new optimizations. These range from new- 
thread scheduling algorithms that make use of information on 
when a thread's behavior changes, to feedback directed op- 
timizations targeted at not only the aggregate performance 
of the code but individual phases of execution, to creating 
simulations that accurately model full program behavior. To 
enable these optimizations, we must first develop the analyt- 
ical tools necessary to automatically and efficiently analyze 
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program behavior over large sections Df execution. 

In order to perform such an analysis we need to develop a 
hardware independent metric that can concisely summarize 
the behavior of on arbitrary section of execution in a pro- 
gram. In [19], we presented the use of Basic Block Vectors 
(BBV), which uses the structure of the program that is ex- 
ercised during execution to determine where to simulate. A 
BBV represents the code blocks executed during a given in- 
terval of execution. Our goal was to find a single continuous 
window of executed instructions that match the whole pro- 
gram's execution, so that this smaller window of execution 
can be used for simulation instead of executing the program 
to completion. Using the BBVs provided us with a hardware 
independent way of finding this small representative window. 

In this paper we examine the use of BBVs for analyzing 
large scale program behavior. We use BBVs to explore the 
large scale behavior of several programs and discover the 
ways in which common patterns, and code, repeat themselves 
over the course of execution. We quantify the effectiveness of 
basic block vectors in capturing this program behavior across 
several different architectural metrics (such as IPC, branch, 
and cache miss rates). 

In addition to this, there is a need for a way of classifying 
these repeating patterns so that this information can be used 
for optimization. We show that this problem of classifying 
sections of execution is related to the problem of cluster- 
ing from machine learning, and we develop an algorithm to 
quickly and effectively find these sections based on clustering. 
Our techniques automatically break the full execution of the 
program up into several sets, where the elements of each set 
are very similar. Once this classification is completed, anal- 
ysis and optimization can be performed on a per-set basis. 

We demonstrate an application of this cluster-based be- 
havior analysis to simulation methodology for computer ar- 
chitecture research. By making use of clustering information 
we are able to accurately capture the behavior of a whole 
program by taking simulation results from representatives of 
each cluster and weighing them appropriately. This results 
in finding a set of simulation points that when combined ac- 
curately represents the target application and input, which 
in turn allows the behavior of even very complicated pro- 
grams such as gec to be captured with a small amount of 
simulation time. We provide simulation points (points in the 
program to start execution at) for Alpha binaries of all of the 
SPEC 2000 programs. In addition, we validate these simula- 
tion points with the IPC, branch, and cache miss rates found 
for complete execution of the SPEC 2000 programs. 

The rest of the paper is laid out as follows. First, a sum- 
mary of the methodology used in this research is described 



in Section 2. Section 3 presents a brief review of basic block 
vectors and an in depth look into the proposed techniques 
and algorithms for identifying large scale program behaviors, 
and an analysis of their use on several programs. Section 4 
describes how clustering can be used to analyse program be- 
havior, and describes the clustering methods used in detail. 
Section 5 examines the use of the techniques presented in 
Sections 3 and 4 an an example problem: finding where to 
simulate in a program to achieve results representative of full 
program behavior. Related work is discussed in Section 6, 
and the techniques presented are summarized in Section 7. 

2. METHODOLOGY 

In this paper we used both ATOM [21] and SimpleScalar 
3.0c [3] to perform our analysis and gather our results for 
the Alpha AXP ISA. ATOM is used to quickly gather pro- 
filing information about the code executed for a program. 
SimpleScalar is used to validate the phase behavior we found 
when clustering our basic block profiles showing that this 
corresponds to the phase behavior in the programs perfor- 
mance and architecture metrics. The baseline microarchitec- 
ture model we simulated is detailed in Table 1. We simulate 
an aggressive 8-way dynamically scheduled microprocessor 
with a two level cache design. Simulation ia execution-driven, 
including execution down any speculative path until the de- 
tection of a fault, TLB miss, or branch mis-prediction. 

We analyze and simulated all of the SPEC 2000 bench- 
marks compiled for the Alpha ISA. The binaries we used 
in this study and how they were compiled can be found at: 
http : / /www . a inpXea calai . com/. 

3. USING BASIC BLOCK VECTORS 

A basic block is a section of code that is executed from 
start to finish with one entry and one exit. We use the fre- 
quencies with which basic blocks are executed as the metric 
to compare different sections of the application's execution 
to one another. The intuition behind this is that the be- 
havior of the program at a given time is directly related tD 
the code it is executing during that interval, and basic block 
distributions provide us with this information. 

A program, when run for any interval of time, will execute 
each basic blade a certain number of times. Knowing this 
information provides us with a fingerprint for that interval 
of execution, and tells us where in the code the application 
is spending its time. The basic idea is that knowing the ba- 
sic block distribution for two different intervals gives us two 
separate fingerprints which we can then compare to find out 
how similar the intervals arc to one another. If the finger- 
prints ore similar, then the two intervals spend about the 
same amount of time in the same code, and the performance 
of those two intervals should be similar. 

3.1 Basic Block Vector 

A Basic Block Vector (BBV) is a single dimensional array, 
where there is a single element in the array for each static 
basic block in the program. For the results in this paper, the 
basic block vectors are collected in intervals of 100 million 
instructions throughout the execution of a program. At the 
end of each interval, the number of times each basic block is 
entered during the interval is recorded and a new count for 
each basic block begins for the next interval of 100 million in- 
structions. Therefore, each element in the array is the count 
of how many times the corresponding basic block has been 
entered during on interval of execution, multiplied by the 



number of instructions in that basic block. By multiplying 
in the number of instructions in each basic block we insure 
that we weigh instructions the same regardless of whether 
they reside in a large or small basic block. We say that a Ba- 
sic Black Vector which was gathered by counting basic block 
executions over an interval oEJV x 100 million instructions, 
is a Basic Block Vector of duration N. 

Because we are not interested in the actual count of basic 
block executions for a given interval, but rather the propor- 
tions between time spent in basic blocks, a BBV is normal- 
ized by having each element divided by the sum of all the 
elements in the vector, 

3.2 Basic Block Vector Difference 

In order to End patterns in the program we must first have 
some way of comparing two Basic Black Vectors. The oper- 
ation we desire takes as input two Basic Block Vectors, and 
outputs a single number which tells us how close they are to 
each other. There are several ways of comparing two vectors 
to one another, such as taking the dot product or finding 
the Euclidean or Manhattan distance. In this paper we use 
both the Euclidean and Manhattan distances for comparing 
vectors. 

The Euclidean distance can be found by treating each vec- 
tor as a single point in XJ-dimensional space. The distance 
between two points is simply the square root of the sum of 
squares just as in c 3 = a 2 +b 2 . The formula for computing the 
Euclidean distance of two vectors a and b in D-dimensional 
space is given by: 



EudideanDistfa, b) = ^ ^ (ai — hif 

The Manhattan distance on the other hand is the distance 
between two points if the only paths you can take are parallel 
to the axes. In two dimensions this is analogous to the dis- 
tance traveled if you were to go by car through city blocks. 
This has the advantage that it weighs more heavily differ- 
ences in each dimension (being closer in the z-dimension does 
not get yau any closer in the y-dimenston). The Manhattan 
distance is computed by summing the absolute value of the 
element-wise subtraction of two vectors. For vectors a and b 
in .D-dimensional space, the distance can be computed as: 

D 

Manha.tta-n.Dist (a, b) = ^ |o< — &*l 
>=i 

Because we have normalized all of the vectors, the Manhat- 
tan distance will always be a single number between 0 and 2 
(because we normalize each BBV to sum to 1). This number 
can then be used to compare how closely related two intervals 
of execution are to one another. For the rest of this section 
we will be discussing distances in terms of Manhattan dis- 
tance, because wc found that it more accurately represented 
differences in our high-dimensional data. We present the Eu- 
clidean distance as it pertains to the clustering algorithms 
presented in Section 4, since it provides a more accurate rep- 
resentation for data with lower dimensions. 

3.3 Basic Block Similarity Matrix 

Now that we have a method of comparing intervals of pro- 
gram execution to one another, we can now concentrate on 
finding phase-based behavior. A phase of program behav- 
ior can be defined in several ways. Fast definitions are built 
around the idea of a phase being a contiguous interval of exe- 
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Instruction Cache 


8k 2-way set-associative, 32 byte blocks, 1 cycle latency 


Data Cache 


16k 4-way set-associative, 32 byte blocks, 2 cycle latency 


Unified L2 Cache 


lMeg 4-way set-associative, 32 byte blocks, 20 cycle latency 


Memory 


150 cycle round trip access 


Branch Predictor 


hybrid - 8-bit gsh&re w/ 8k 2-bit predictors a 8k bimodal predictor 


Out-oF-Order Issue 


out-of-order issue of up to 8 operations per cycle, 12B entry re-order buffer 


Mechanism 


Ioad/stDre queue, loads may execute when all prior store addresses are known 


Architecture Registers 


32 integer, 32 floating point 


Functional Units 


B-integer ALU, 4-load/store units, 2-FP adders, 2-integer MULT/DIV, 2-FP MULT/DIV 


Virtual Memory 


8K byte pages, 30 cycle fixed TLB miss latency after earlier-issued instructions complete 



Table 1: Baseline Simulation Model. 



cution during which a measured program metric is relatively 
stable. We extend this notion of a phase to include all similar 
sections of execution regardless of temporal adjacency. 

A key observation from this paper is that the phase be- 
havior seen in any program metric is directly a function of 
the code being executed. Because of this we can use the 
comparison between the Basic Block Vectors as an approxi- 
mate bound on haw closely related any other metrics will be 

To find Iidw intervals of execution relate to one another we 
create a Basic Black Similarity Matrix. The similarity matrix 
is an upper triangular NxN matrix, where N is the number 
of intervals in the program's execution. An entry at (x, y) in 
the matrix, represents the Manhattan distance between the 
basic block vector at interval x and the basic block vector at 
interval y. 

Figures l(left and right) and 4 (left) shows the similarity 
matrices for gzip, bzip, and gec using the Manhattan dis- 
tance. The diagonal of the matrix represents the program's 
execution over time from start to completion. The darker 
the points, the more similar the intervals are (the Manhat- 
tan distance is closer to 0), and the lighter they are the more 
different they are (the Manhattan distance is closer to 2). 

The top left corner of each graph is the start of program 
execution and is the origin of the graph, (0, 0), and the bot- 
tom right of the graph is the point (N — 1, N — 1) where N 
is the number of intervals that the full program execution 
was divided up into. The way to interpret the graph is to 
start considering points along the diagonal axis drawn. Bach 
point is perfectly similar to itself, sd the points directly on 
the axis all are drawn dark. Starting from a given point on 
the diagonal axis of the graph, you can begin to compare how 
that point relates to it's neighbors forward and backward in 
execution by tracing horizontally or vertically. If you wish 
to compare a given interval x with the interval at x + n, 
you simply start at the point (x, x) on the graph and trace 
horizontally to the right until you reach (ar,z + u). 

To examine the phase behavior of programs, let us first 
examine gzip because it has behavior on such a large scale 
that it is easy to see. If we examine an interval taken from 70 
billion instructions into execution, in Figure 1 (left), this is 
directly in the middle of a large phase shown by the triangle 
block of dark color that surrounds this point. This means 
that this interval is very similar to it's neighbors both forward 
and backward in time. We can also see that the execution at 
50 billion and 90 billion instructions is also very similar to the 
program behavior at 70 billion. We also note, while it may 
be hard to see in a printed version that the phase interval at 
70 billion instructions is similar to the phases at interval 10 
and 30 billion, but they are not as similar as to those around 
50 and 90 billion. Compare this with the IPG and data cache 
miss rates for gzip shown in Figure 2. Overall, Figure l(left) 
shows that the phase behavior seen in the similarity matrix 
lines up quite closely with the behavior of the program, with 



5 large phases (the first 2 being different from the last 3) each 
divided by a small phase, where all of the small phases are 
very similar to each other. 

The similarity matrix for bzip (shown on the right of Fig- 
ure 1) is very interesting. Bzip has complicated behavior, 
with two large parts to it's execution, compression and de- 
compression. This can readily be seen in the figure as the 
large dark triangular and square patches. The interesting 
thing about bzip is that even within each of these sections of 
execution there is complex behavior. This, as will be shown 
later, makes the behavior of bzip impossible to capture using 
a small contiguous section of execution. 

A more complex cose for finding phase behavior is gec, 
which is shown on the left of Figure 4. This similarity ma- 
trix shows the results for gec using the Manhattan distance. 
The similarity matrix on the right will be explained in more 
detail in Section 4.2.1. This figure shows that gec does have 
some regular behavior. It shows that, even here, there is com- 
mon code shared between sections of execution, such as the 
intervals around 13 billion and 36 billion. In fact the strong 
dork diagonal line cutting through the matrix indicates that 
there is good amount of repetition between offset segments of 
execution. By analyzing the graph we can see that interval 
x is very similar to interval {x + 23.6B) for oil a:. 

Figures 2 and 5 show the time varying behavior of gzip 
and gec. The average IPC and data cache miss rate Is shown 
for each 100 million interval of execution over the complete 
execution of the program. The time varying results graphi- 
cally show the same phase behavior seen by looking at only 
the code executed. For example, the two phases for gec at 13 
billion and 3S billion, shown to be very similar in Figure 4, 
are shown to have the same IPC and data cache miss rate in 
Figure 5. 

4. CLUSTERING 

The basic block vectors provide a compact and represen- 
tative summary of the program's behavior for intervals of 
execution. By examining the similarity between them, it is 
clear that there exists a high level pattern to each program's 
execution. In order to make use of this behavior we need 
to start by delineating a method of Ending and represent- 
ing the information. Because there are so many intervals of 
execution that are similar to one another, one efficient repre- 
sentation is to group the intervals together that have similar 
behavior. This problem is analogous tD a clustering problem. 
Later, in Section 5, we demonstrate how we use the clusters 
we discover to find multiple simulation points for irregular 
programs or inputs like gec. By simulating only a single rep- 
resentative from each cluster, we con accurately represent the 
whole program's execution. 

4.1 Clustering Overview 

The goal of clustering is to divide a set of points into groups 
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Figure 1: Basic block similarity matrix for the programs gzip -graphic (shown left) and bzip-graphic (shown 
right). The diagonal of the matrix represents the program's execution to completion with units in billions of 
instructions. The darker the points, the more similar the intervals are (the Manhattan distance is closer to 
0), and the lighter the points the more different they are (the Manhattan distance is closer to 2). 
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Figure 2: (top graph) Time varying graph for gzip-graphic. The average IPC (drawn with solid line) and 
LI data cache miss rate (drawn with dotted line) are plotted for every interval (100 million instructions of 
execution) showing how these metrics vary over the program's execution. The x-axis represents the execution 
of the program over time, and the y-axis the percent of max value the metric had during execution. The 
results are non-accumulative. 

Figure 3: (bottom graph) Cluster graph for gzip-graphic. The full ru 
set of 6 clusters. The x-axis is in instructions executed, and the grap 
(every 100 million instructions), which cluster the interval was placed 
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Figure 4: The original basic black similarity matrix for the program gcc (shown left), and the similarity matrix 
for gcc-166 drawn from projected data (on right). The figure on the left use the original basic block vectors 
(each of which has over 100,000 dimensions) and uses the Manhattan distance as a method af difference 
taking. The figure on the right uses projected data (down to IS dimensions) and uses the Euclidean distance 
for difference taking. 
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Figure 5: (top graph) Time varying graph for gce-166. The average IPC (drawn with solid line) and LI data 
cache miss rate (drawn with dotted line) are plotted for every interval (100 million instructions of execution) 
showing how these metrics vary over the program's execution. The x-oxis represents the execution of the 
program over time, and the y-axis the percent of max value the metric had during execution. The results are 
non-accumulative. 

Figure 6: (bottom graph) Cluster graph for gcc-166. The full run of the execution is partitioned into a set of 
4 clusters. The x-axis is in instructions executed, and the graph shows for each interval of execution (every 
100 million instructions), which cluster the interval was placed into. 



such that points within each group axe similar to one an- 
other (by some metric, often distance), and points in different 
groups are different from Dne another. This problem arises in 
other fields such aa computer vision [10], document classifica- 
tion [22], and genomics [1], and as such it is an area of much 
active research. There are many clustering algorithms and 
many approaches tD clustering. Classically, the two primary 
clustering approaches are Partitioning and Hierarchical: 

Partitioning algorithms choose an initial solution and then 
use iterative updates to find a better solution. Popular al- 
gorithms such as fc-means [14] and Gaussian Expectation- 
Maximization [2, pages 59-73] are in this family. These al- 
gorithms tend to have run time that is linear in the size of 
the dataset. 

Hierarchical algorithms [9] either combine together sim- 
ilar points (called agglomerative clustering, and conceptu- 
ally similar to Huffman encoding), or recursively divides the 
dataset into more groups (called divisive clustering). These 
algorithms tend to have run time that is quadratic in the size 
of the dataset. 

4.2 Phase Finding Algorithm 

For our algorithm, we use random linear projection fol- 
lowed by fc-means. We choose to use the fc-means clustering 
algorithm, since it is a very fast and simple algorithm that 
yields good results. To choose the value of fc, we use the 
Bayesian Information Criterion (BIC) score [11, 17]. The 
following steps summarize our algorithm, and then several of 
the steps are explained in more detail: 

1. Profile the basic blocks executed in each program to 
generate the basic block vectors for every 100 million 
instructions of execution. 

2. Reduce the dimension of the BBV data to 15 dimen- 
sions using random linear projection. 

3. Try the fc-means clustering algorithm on the 
low-dimensional data for fc values 1 to 10. Each run of 
fc-means produces a clustering, which is a partition of 
the data into k different clusters. 

4. For each clustering {fc = 1...10), score the fit of the 
clustering using the BIC. Choose the clustering with 
the smallest k, such that it's score is at least 90% as 
good as the best score, 

4.2.1 Random Projection 

For this clustering problem, we have to address the prob- 
lem of dimensionality. All clustering algorithms suffer from 
the so-called "curse of dimensionality" , which refers to the 
fact that it becomes extremely hard to cluster data as the 
number of dimensions increases. For the basic block vectors, 
the number of dimensions is the number of executed basic 
blocks in the program, which ranges from 2,756 to 102,038 
far our experimental data, and could grow into the millions 
for very large programs. Another practical problem is that 
the running time of our clustering algorithm depends on the 
dimension of the data, making it slow if the dimension grows 
too large. 

Two ways of reducing the dimension of data are dimension 
selection and dimension reduction. Dimension selection sim- 
ply removes all but a small number of the dimensions of the 
data, based on a measure of goodness of each dimension for 
describing the data. However, this throws away a lot of data 
in the dimensions which are ignored. Dimension reduction 



reduces the number of dimensions by creating a new lower- 
diroensianal space and then projecting each data point into 
the new space (where the new space's dimensions are not 
directly related to the old space's dimensions). This Is anal- 
ogous to taking a picture of 3 dimensional data at a random 
angle and projecting it onto a screen of 2 dimensions. 

For this work we choose to use random linear projection [5] 
to create a new low-dimensional space into which we project 
the data. This is a simple and fast technique that is very ef- 
fective at reducing the number of dimensions while retaining 
the properties of the data. There are two steps to reducing 
a dataset X (which Is a matrix of basic block vectors and is 
of size N in t m *iB * D numb b, where D num bb is the number of 
basic blocks in the program) down to D nBW dimensions using 
random linear projection: 

• Create a £> nU Tnbb x D ncw projection matrix M by choos- 
ing a random value for each matrix entry between -1 
and 1. 

• Multiply X times M to obtain the new lower-dimensional 
dataset X' which will be of size Mntertiiila X D„ EU ,. 

For clustering programs, we found that using D n cw = 
15 dimensions is sufficient to still differentiate the different 
phases of execution- Figure 7 shows why we chose tD project 
the data down to 15 dimensions. The graph shows the num- 
ber of dimensions on the x-axis. The y-aads represents the k 
value found to be best on average, when the programs were 
projected down to the number of dimensions indicated by the 
x-axis. The best k is determined by the k with the highest 
BIC score, which Is discussed in Section 4.2.3. The y-axis is 
shown as a percent of the maximum k seen for each program 
so that the curve can be examined independent of the actual 
number of clusters found for each program. The results show 
that for 15 dimensions the number of clusters found begins 
to stabilize and only climbs slightly. Similar results were also 
found using a different method of finding k in [6]. 

The advantages of using linear projections are twofold. 
First, creating new vectors with a low dimension of 15 is 
extremely fast and can even be done at simulation time. Sec- 
ondly, using only 15 dimensions speeds up the k-tneans al- 
gorithm significantly, and reduces the memory requirements 
by several orders of magnitude over using the original basic 
block vectors. 

Figure 4 shows the similarity matrix for gec on the left 
using original BBVs, whereas the similarity matrix on the 
right ahowB the same matrix but on the data that has been 
projected down to 15 dimensions. For the reduced dimension 
data we use the Euclidean distance to measure differences, 
rather than the Manhattan distance used on the full data. 
After the projection, some information will be blurred, but 
overall the phases of execution that are very similar with full 
dimensions can still be seen to have a strong similarity with 
only 15 dimensions. 

4.2.2 K-means 

The fc-means algorithm is an iterative optimization algo- 
rithm, which executes as two phases, which are repeated to 
convergence. The algorithm begins with a random assign- 
ment of fc different centers, and begins its iterative process. 
The iterations are required because of the recursive nature 
of the algorithm; the cluster centers define the cluster mem- 
bership for each data point, but the data point memberships 
define the cluster centers. Each point in the data belongs to, 
and can be considered a member of, a single cluster. 
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Figure 7: Motivation for random projection down 
to 15 dimensions (D=15). The x-axis 5a the num- 
ber of dimensions of the projection, and the y-ajtis ia 
the percent of the max number of clusters found for 
each program averaged over all spec programs. The 
results show that as you decrease the number of di- 
mensions too far (the lowest point is two dimensions) 
the true clusters become collapsed on one another, 
and the algorithm cannot find as many clusters. By 
D=15 most of this effect has gone. 

We initialize the k cluster centers by choosing fe random 
points from the data to be clustered. After initialization, the 
/c-means algorithm proceeds in two phases which are repeated 
until convergence: 

• For each data point being clustered, compare its dis- 
tance to each d£ the k cluster centers and assign it to 
(make it a member of) the cluster to which it is the 
closest. 

• For each cluster center, change its position to the cen- 
troid of all of the points in its cluster (from the mem- 
berships just computed). The centroid is computed as 
the average of all the data points in the cluster. 

This process is iterated until membership (and hence clus- 
ter centers) cease to change between iterations. At this point 
the algorithm terminates, and the output is a set of final clus- 
ter centers and a mapping of each point to the cluster that 
it belongs to. Since we have projected the data down to 15 
dimensions, we can quickly generate the clusters for t-means 
with k from 1 to 10. In doing this, there are efficient algo- 
rithms for comparing the clusters that are formed for these 
different values of k, and choosing one that is good but still 
uses a small value for k is the next problem. 

4.2.3 Bayesian Information Criterion 

To compare and evaluate the different clusters formed for 
different k, we use the Bayesian Information Criterion (BIC) 
ns a measure of the "goodness of fit" of a clustering to a 
dataset. More formally, the BIC is an approximation to the 
probability of the clustering given the data that has been 
clustered. Thus, the larger the BIG score, the higher the 
probability that the clustering being scored is a "good fit" to 
the data being clustered. We use the BIC formulation given 
in [17) for clustering with fc-means, however other formula- 
tions of the BIC could also be used. 



More formally, the BIC score is a penalized likelihood. 
There are two terms in the BIC: the likelihood and the penally. 
The likelihood is a measure of how well the clustering models 
the data. To get the likelihood, each cluster is considered to 
be produced by a spherical Gaussian distribution, and the 
likelihood of the data in a cluster is the product of the prob- 
abilities of each point in the cluster given by the Gaussian. 
The likelihood for the whale dataset is just the product of the 
likelihoods for all clusters, However, the likelihood tends to 
increase without bound as more clusters are added. There- 
fore the second term is a penalty that offsets the likelihood 
growth based on the number of clusters. The BIC is formu- 
lated as 

BIC{D, k) = l(D\k) - log(A) 

where l{D\k) is the likelihood, R is the number of points in 
the data, and pj is the number of parameters to estimate, 
which is (A: - 1) -f- dk + 1 for (k — 1) cluster probabilities, 
k cluster center estimates which each require d dimensions, 
and 1 variance estimate. To compute l(D\k) we use 
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where Ri is the number of points in the ith cluster, and a* 
is the average variance of the Euclidean distance from each 
point to its cluster center. 

For a given program and inputs, the BIC score is calculated 
for each k-means clustering, for k from 1 to N. We then choose 
the clustering that achieves a BIC score that is at least 90% 
of the spread between the largest and smallest BIC score 
that the algorithm has seen. Figure 8 shows the benefit of 
choosing a BIC with a high value and its relationship with the 
variance in IPC seen for that cluster. The y-axis shows the 
percent of IPC variance seen for a given clustering, and the 
corresponding BIC score the clustering received. Each point 
on the graph represents the average or max IPC variance for 
all points in the range of ±5% of the BIC score shown. The 
results show that picking clusterings that represent greater 
than 80% of the BIC score resulted in an IPC variance of less 
than 20% on average. The IPC variance was computed as the 
weighted sum of the IPC variance for each cluster, where the 
weight for a cluster is the number of points in that cluster. 
The IPC variance for each cluster is simply the variance of 
the IPC for all the points in that cluster. 

4.3 Clusters and Phase Behavior 

Figures 3 and 6 show the 6 clusters formed for gzip and 
the 4 clusters formed for gcc. The X-axis corresponds to 
the execution of the program in billions of instructions, and 
each interval (each of 100 million instructions) is tagged to 
be in one of the N clusters (labeled on the Y-axis). These 
figures, just as for Figures 1 and 4, show the execution of the 
programs to completion. 

For gzip, the full run of the execution is partitioned into 
a set of 6 clusters. Looking to Figure l(left) for comparison, 
we see that the cluster behavior captured by our tool lines up 
quite closely with the behavior of the program. The majority 
of the points are contained by clusters 1,2,3 and 6. Clusters 
1 and 2 represent the large sections of execution which are 
similar to one another. Clusters 3 and 6 capture the smaller 
phases which lie in between these large phases, while cluster 
5 contains a small subset of the larger phases, and cluster 4 
represents the initialization phase. 
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Figure 8: Plot of average IPC variance and max IPC 
variance versus the BIC. These results indicate that 
for our data, a clustering found to have a BIC score 
greater than 80% will have, on average, and IPC vari- 
ance of less than 0.2- 

In the cluster graph for gee, shown in Figure 6, the run 
is now partitioned into 4 different clusters. Looking to Fig- 
ure 4 for comparison, we see that even the more complicated 
behavior of gcc is captured correctly by our tool. Clusters 
2 and 4 correspond to the dark boxes shown parallel to the 
diagonal axis. It should also be noted that the projection 
does introduce some degree of error into the clustering. For 
example, the first group of points in cluster 2 are not really 
that similar to the other points in the cluster. Comparing 
the two similarity matrices in Figure 4, shows the introduc- 
tion of a dark band at (0,30) on the graph which was not in 
the original (un-projected) data. Despite these small errors, 
the clustering is still very good, and the impact of any such 
errors will be minimized in the next section. 

5. FINDING SIMULATION POINTS 

Modern computer architecture research relies heavily on 
cycle accurate simulation to help evaluate new architectural 
features. While the performance of processors continues to 
grow exponentially, the amount of complexity within a pro- 
cessor continues to grow at an even a faster rate. With each 
generation of processor more transistors are added, and more 
things are done in parallel on chip in a given cycle while at 
the same time cycle times continue to decrease. This grow- 
ing gap between speed and complexity means that the time 
to simulate a constant amount of processor time is growing. 
It is already to the point that executing programs fully to 
completion in a detailed simulator is no longer feasible for 
architectural studies. Since detailed simulation takes a great 
deal of processing power, only a small subset of a whole pro- 
gram can be simulated. 

SimpleScalar [3], one of the faster cycle-level simulators, 
can simulate around 400 million instructions per hour. Un- 
fortunately many of the new SPEC 2000 programs execute 
for 300 billion instructions or more. At 400 million instruc- 
tions per hour this will take approximately 1 month of CPU 
time. 

Because it is only feasible to execute a small portion of 
the program, it is very important that the section simulated 
is an accurate representation of the program's behavior as a 



whole. The basic block vector and cluster analysis presented 
in Sections 3 and 4 will allow us to make sure that this is the 



5.1 Single Simulation Points 

In [19], we used basic block vectors to automatically find a 
single simulation point to potentially represent the complete 
execution of a program. A Simulation Point is a starting 
simulation place (in number Df instructions executed from the 
start of execution) in a program's execution derived from our 
analysis. That algorithm creates a target basic block vector, 
which is a BBV that represents the complete execution of 
the program. The Manhattan distance between each interval 
BBV and the target BBV is computed. The BBV with the 
lowest Manhattan distance represents the single simulation 
point that executes the code closest to the complete execution 
of the program. This approach is used to calculate the long 
single simulation points (LongSP) described below. 

In comparison, the single simulation point results in this 
paper are calculated by choosing the BBV that has the small- 
est Euclidean distance from the centroid of the whole dataset 
in the 15-dimensional Bpace, a method which we find supe- 
rior to the original method. The 15-dimensional centroid is 
formed by taking the average of each dimension over all in- 
tervals in the cluster. 

Figure 9 shows the IPC estimated by executing only a 
single interval, all 100 million instructions long but chosen 
by different methods, for all SPEC 2DD0 programs. This 
is shown in comparison to the IPC found by executing the 
program to completion. The results are from SimpleScalar 
using the architecture model described in Section 2, and all 
fast forwarding is done so that all of the architecture struc- 
tures are completely warmed up when starting simulation (no 
cold-start effect). 

The first bar, labeled none, is the IPC found when exe- 
cuting only the first 100 million instructions from the start 
of execution (without any fast forwarding). The second bar, 
FF-Billion shows the results after blindly fast forwarding 1 
billion instructions before starting simulation. The third bar, 
SimPoin-t shows the IPC using our single simulation point 
analysis described above, and the last bar shows the IPC of 
simulating the program to completion (labeled Full). Be- 
cause these are actual IPC values, values which are closer to 
the Full bar are better. 

The results in Figure 9 shows that the single simulation 
points are very close to the actual full execution of the pro- 
gram, especially when compared against the ad-hoc tech- 
niques. Starting simulation at the start of the program re- 
sults in an average error of 210%, whereas blindly fast for- 
warding results in an average 80% IPC error. Using our single 
simulation point analysis we reduce the average IPC error to 
18%. These results show that it is possible to reasonably cap- 
ture the behavior of the most programs using a very small 
slice of execution. 

Table 2 shows the actual simulation points chosen along 
with the program counter (PC) and procedure name corre- 
sponding to the start of the interval. If an input is not at- 
tached to the program name, then the default ref input was 
used. Columns 2 through 4 are in terms of the number of 
intervals (each 100 million instruction long). The first col- 
umn is the number of instructions executed by the program, 
on the specific input, when run to completion. The second 
column shows the end of initialization phase calculated as de- 
scribed in [19]. The third column shows the single simulation 
point automatically chosen as described above. This simu- 
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Figure 9: Simulation results starting simulation at the start of the program (none), blindly Fast forwarding 1 
billion instructions, using a single simulation point, and the IPC of the full execution of the program. 



lation point is used to fast forward to the point of desired 
execution. Some simulators, debuggers, or tracing environ- 
ments (e.g., gdb) provide the ability to fast forward based 
upon a program PC, and the number of times that PC was 
executed. We therefore, provide the instruction PC for the 
start of the simulation point, the procedure that PC occurred 
in, and the number of times that PC has to be executed in 
order to arrive at the desired simulation point. 

These results show that a single simulation point can be 
accurate for many programs, but there is still a significant 
amount of error for programs like bzip, gzip and gcc. This 
occurs because there are many different phases of execution 
in these programs, and a single simulation point will not ac- 
curately represent all of the different phases. To address this, 
we used our clustering analysis to find multiple simulation 
points to accurately capture these programs behavior, which 
we describe next. 



5.2 Multiple Simulation Points 

To support multiple simulation points, the simulator can 
be run from start to stop, only performing detailed simu- 
lation on the selected intervals. Or the simulation can be 
broken down into N simulations, where N is the number of 
clusters found via analysis, and each simulation is run sepa- 
rately. This has the further benefit of breaking the simulation 
down into parallel components that can be distributed across 
many processors. This is the methodology we use in our sim- 
ulator. For both cases results from the separate simulation 
points need to be weighed and combined to arrive at overall 
performance for the program [4]. Care must be taken to com- 
bine statistics correctly (simply averaging will give incorrect 
results for statistics such as rates). 

Knowing the clustering alone is not sufficient to enable 
multiple point simulation because the cluster centers do not 
correspond to actual intervals of execution. Instead, we must 
first pick a representative for each cluster that will be used to 
approximate the behavior of the the full cluster. In order to 
pick this representative, we choose for each cluster the actual 
interval that is closest to the center (centroid) of the cluster. 



In addition to this, we weigh any use of this representative by 
the sfcse of the cluster it is representing. If a cluster has only 
one point, it's representative will only have a small impact 
on the overall outcome of the program. 

Table 2 shows the multiple simulation points found for all 
of the SPEC 2000 benchmarks. For these results we lim- 
ited the number of clusters to be at most six for all but the 
most complex programs. This was done, in order to limit the 
number of simulation points, which also limits the amount of 
warmup time needed to perform the overall simulation. The 
cluster formation algorithm in Section 4 takes as an input 
parameter the max number of clusters to be allowed. Each 
simulation point contains two numbers. The first number is 
the location of the simulation point in IGOs of millions of in- 
structions. The second number in parentheses is the weight 
for that simulation point, which is used to create an overall 
combined metric. Each simulation point corresponds to 100 
million instructions. 

Figure 10 shows the IPC results for multiple simulation 
points. The first bar shows our single simulation points sim- 
ulating for 100 million instructions. The second bar LongSP 
chooses a single simulation point, but the length of simula- 
tion is identical to the length used for multiple simulation 
points (which may go up to 1 billion instructions). This is 
to provide a fair comparison between the single simulation 
points and multiple. The Multiple bar shows results using 
the multiple simulation points, and the final bar is IPC for 
full simulation. As in Figure 9, the closer the bar is to Full, 
the better. 

The results show that the average IPC error rate is re- 
duced to 3% using multiple simulation points, which is down 
from 17% using the long single simulation point. This is sig- 
nificantly lower than the average 80% error seen for blindly 
fast forwarding. The benefits can be most clearly seen in 
the programs bzip, gcc, aminp, and galgel. The reason that 
the long contiguous simulation points do not do much better 
is that they are constrained to only sample at one place in 
the program. For many programs this is sufficient, but for 
those with interesting long term behavior, such as bzip, it is 
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Table 2: Single simulation points for SPEC 200D benchmarks. Columns 2 through 4 are in terms of 100 
million instruction executed. The length of full execution is shown, as well as the end of initialization. SP is 
the single simulation point using the approach in this paper. The procedure in which the simulation point 
occurred and its PC are also shown. The last 6 digits of PC of each SimPoint is given in hex, so the address 
is formed from 120xxxxxx. Procedure names that end in "." were truncated due to space. The rest of the 
columns list the multiple simulation points found in 100s of millions. The first number is the starting place 
of the simulation point relative to the start of execution. The second number shows the weight given to the 
cluster that simulation point was taken from, and is used when weighing the final results of the simulation. 
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Figure 10: Multiple simulation point results. Simulation results are shown for using a single simulation point 
simulating for 100 million instructions, LongSP chooses a single simulation point simulating for the same 
length of execution as the multiple point simulation, simulation using multiple simulation points, and the full 
execution of the program. 



impossible to approximate the full behavior. 

Figure 11 is the average over all of the floating point pro- 
grams (top graph] and integer programs (bottom graph). Er- 
rors For IPC, branch miss rate, instruction and data cache 
miss rates, and the unified L2 cache miss rate for the archi- 
tecture presented in Section 2 are shown. The errors are with 
respect to these metrics for the full length of simulation us- 
ing SimpleScalar. Results are shown for starting simulation 
at the start of the program None, blindly fast forwarding a 
billion instructions FF-Billion, single simulation paints of 
duration 1 (SinsPoin-t) and k (LongSP), and multiple simula- 
tion points (Multiple). 

The first thing to note is that using the just a single small 
simulation point performs quite well on average across all of 
the metrics when compared to blindly fast-forwarding. Even 
though a single SimPoint does well, it is clearly beaten by 
using the clustering based scheme presented in this paper 
across all of the metrics examined. One thing that stands out 
on the graphs is that the error rate of the instruction cache 
and L2 cache appear to be high (especially for the integer 
programs) despite the fact that our technique is doing quite 
well in terms of overall performance. This is due to the fact 
that we present here an arithmetic mean of the errors, and 
there are several programs that have high error rates due to 
the very small number of cache misses. If there are 10 misses 
in the whole program, and we estimate there to be 10D, that 
will result in a error of 1CX. We point to the overall IPC 
as the most important metric for evaluation as it implicitly 
weighs each oF the metrics by it's relative importance. 



6. RELATED WORK 

Time Varying Behavior of Programs: In [IS], we provided 
a first attempt at showing the periodic patterns for all of 
the SPEC 95 programs, and haw these vary over time for 
cache behavior, branch prediction, value prediction, address 
prediction, IPC and RTJU occupancy. 



Training Inputs and Finding Smaller Representative In- 
puts: One approach for reducing the simulation time is to 
use the training or test inputs from the SPEC benchmark 
suite. For many of the benchmarks, these inputs are either 
(1) still too long to fully simulate, or (2) too short and place 
too much emphasis on the startup and shutdown parts of 
the program's execution, Dr (3) inaccurately estimate behav- 
ior such as cache accesses do to decreased worldng set size. 

KleinOsowslri et. al [12], have developed a technique where 
they manually reduce the input sets of programs. The input 
sets were developed using a range of approaches from trun- 
cating of the input files to modification of source code to 
reduce the number of times frequent loops were traversed. 
For these input sets they develop, they make sure that they 
have similar results in terms of IPC, cache, and instruction 

Fast Forwarding and Check-pointing: Historically researchers 
have simulated from the start of the application, but this 
usually does not represent the majority of the program's be- 
havior because it is still in the initialization phase. Recently 
researchers have started to fast-forward to a given point in 
execution, and then start their simulation from there, ide- 
ally skipping over the initialization code to an area of code 
representative of the whole. During fast-forward the simula- 
tor simply needs to act as a functional simulator, and may 
take full advantage of optimizations like direct execution. Af- 
ter the fast-forward point has been reached, the simulator 
switches to full cycle level simulation. 

After fast-forwarding, the architecture state to be simu- 
lated is still cold, and a waxmup time is needed in order to 
start collecting representative results. Efficiently warming 
up execution only requires references immediately proceed- 
ing the start of simulation. Has kins and Skadron [7] exam- 
ined probabilistically determining the minimum set of fast- 
forward transactions that must be executed for warm up to 
accurately produce state as it would have appeared had the 
entire fast-forward interval been used for warm up [7]. They 
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recently examined using reuse analysis to determine how for 
before full simulation warmup needs to occur [8]. 

An alternative to Fast forwarding is to use check-pointing 
to start the simulation of a program at a specific point. With 
check-pointing, code is executed to a given point in the pro- 
gram and the state is saved, or checkpointed, so that other 
simulation runs can start there. In this way the initializa- 
tion section can he run just one time, and there is no need 
to fast forward past it each time. The architectural state 
(e.g., caches, register file, branch prediction, etc) can either 
be stared in the trace (if they are not going to change across 
simulation runs) or can be warmed up in a manner similar 
to described above. 

Automatically Finding Where to Simulate: Our work is 
based upon the basic block distribution analysis in [19] as 
described in prior sections. Recent work on finding simula- 
tion points for data cache simulations is presented by Lafage 
and Seznec [13]. They proposed a technique to gather statis- 
tics over the complete execution of the program and use them 
to choose a representative slice of the program. They evalu- 
ate two metrics, one which captures memory spatial locality 
and one which captures memory temporal locality. They fur- 
ther propose to create specialized metrics such as instruction 
mix, control transfer, instruction characterization, and dis- 
tribution of data dependency distances to further quantify 
the behavior of the both the program's full execution and 
the execution of samples. 

Statistical Sampling: Several different techniques have been 
proposed for sampling to estimate the behavior of the pro- 
gram as a whole. These techniques take a number of contigu- 
ous execution samples, referred to as clusters in [4], across the 
whole execution of the program. These clusters are spread 
out throughout the execution of the program in an attempt 
to provide a representative section of the application being 
simulated. Conte et. al [4] formed multiple simulation points 
by randomly picking intervals of execution, and then exam- 
ining how these fit to the overall execution of the program for 
several architecture metrics (IPC and branch and data cache 
statistics). Our work is complementary to this, where we 
provide a fast and metric independent approach for picking 
multiple simulation points based just on basic block vector 
similarity. When an architect gets a new binary to exam- 



ar the SPEC 20D0 floating point (top) and integer (bottom) benchmarks 
;ruction, data and unified L2 cache miss rates. 

ine they can use our approach to quicldy End the simulation 
points, and then validate these with detailed simulation in 
parallel with using the binary. 

Statistical Simutatian: Another technique to improve sim- 
ulation time is to use statistical simulation [16]. Using sta- 
tistical simulation, the application is run once and a syn- 
thetic trace is generated that attempts to capture the whole 
program behavior. The trace captures such characteristics 
as basic block size, typical register dependencies and cache 
misses. This trace is then run for sometimes as little as 50- 
100,000 cycles on a much faster simulator. Nussbaum and 
Smith [15] also examined generating synthetic traces and us- 
ing these for simulation and was proposed for fast design 
space exploration. We believe the techniques presented in 
this paper are complementary to the techniques of Oskin et 
. and Nussbaum and Smith in that more accurate profiles 



can be determined using our techniques, and instead of at- 
tempting to characterize the program as a whole it can be 
characterized on a per-phase basis- 

7. SUMMARY 

At the heart of computer architecture and program opti- 
mization is the need for understanding program behavior. As 
we have shown, many programs have wildly different behav- 
ior an even the very largest of scales (over the full lifetime of 
the program). While these changes in behavior are drastic, 
they are not without order, even in very complex applica- 
tions such as gcc. In order to help future compiler and ar- 
chitecture researchers in exploiting this large scale behavior, 
we have developed a set of analytical tools that are capable 
of automatically and efficiently analyzing program behavior 
over large sections of execution. 

The development of the analysis is founded on a hardware 
independent metric, Basic Block Vectors, that can concisely 
summarize the behavior of an arbitrary section of execution 
in a program. We showed that by using Basic Block Vec- 
tors one can capture the behavior of programs as defined by 
several architectural metrics (such as IPC, and branch and 
cache miss rates). 

Using this framework, we examine the large scale behavior 
of several complex programs like gzip, tzip, and gcc, and 
find interesting patterns in their execution over time. The 
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behavior that we find shows that code and program behav- 
ior repeat over time. For example, in the input we exam- 
ined in detail for gcc we see that program behavior repeats 
itselF every 23.6 billion instructions. Developing techniques 
that automatically capture behavior on this scale is useful for 
architectural, system level, and runtime optimizations. We 
present an algorithm based on the identification of clusters 
of basic block vectors that can find these repeating program 
behaviors and group them into sets for further analysis. For 
two of the programs gzip and gcc we show haw the cluster- 
ing algorithm results line up nicely with the similarity matrix 
and correlate with the time varying IPG and data cache miss 
rates. 

It is increasingly common for computer architects and com- 
piler designers to use a small section of a benchmark to 
represent the whole program during the design and evalu- 
ation of a system. This leads to the problem of finding sec- 
tions of the program's execution that will accurately repre- 
sent the behavior of the full program. We show how our 
clustering analysis can be used to automatically find multi- 
ple simulation points to reduce simulation time and to accu- 
rately model full program behavior. We coll this clustering 
tool to find single and multiple simulation points SimPainL 
SimPoint along with additional simulation point data can 
be found at: nttp://BOT.ca.ucBd-edu/ - calder/simpoiiLt/. 
For the SPEC 2000 programs, we found that starting simula- 
tion at the start of the program results in on average error of 
210% when compared to the full simulation of the program, 
whereas blindly fast forwarding resulted in an average B0% 
EPC error. Using a single simulation point found, using our 
basic block vector analysts, resulted in an average 17% IPC 
error. When using the clustering algorithm to create multiple 
simulation points we saw an average IPC error of 3%. 

Automatically identifying the phase behavior using clus- 
tering is beneficial for architecture, compiler, and operating 
system optimizations. To this end, we have used the notion of 
basic black vectors and a random projection to create on ef- 
ficient technique for identifying phases on-the-fly [20], which 
can be efficiently implemented in hardware or software. Be- 
sides identifying phases, this approach can predict not only 
when a phase change is about to occur, but to which phase it 
is about to transition. We believe that using phase informa- 
tion can lead to new compiler optimizations with code tai- 
lored to different phases of execution, multi-threaded archi- 
tecture scheduling, power management, and other resource 
distribution problems controlled by software, hardware or the 
operating system. 
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James A. Flight 



From: 
Sent: 
To: 

Subject: 
Attachments: 
Signed By: 



James A. Flight 

Saturday, July 07, 2007 1:45 PM 
'calder@cs.ucsd.edu' 
Request for citation assistance 
CS2002-0710.ps; CS2002-0710 (2).pdf 
jflig ht@hfelaw.com 



Sensitivity: 



Confidential 



Brad, 



We are trying to determine when the full text of your article "Phase tracking and Prediction" was available to the public. 
On its face, it appears to have been published in June of 2003. However, this link 

( http://www.cse.ucsd.edU/Dienst/UI/2.0/Describe/ncstrl.ucsd cse/CS2002-071 0 ) also available through here 
( http://www.cs.ucsd.edu/Dienst/Ui/2.0/ListAuthors/S?authoritv=ncstrl.ucsd cse) makes it appear that the article may have 
been available on request to you as early as June of 2002 (see the postscript file attached as a PDF hereto for your 
convenience). However, following the link in the Postscript file leads to this list (http://www- 
cse.ucsd.edu/users/calder/papers.html ). which identifies the 2003 publication, and no 2002 publication. 

In view of the foregoing, can you help me understand what, if anything, was available in 2002? I suspect it was merely 
the abstract, but perhaps you had the draft ready and were making it available a year before its official 2003 publication? 

Thanks, in advance, for any assistance you are able to provide. 



ZIMMERMAN™ 

150 South Wacker Drive, Suite 2100 
Chicago, Illinois 60606 

(312) 580-1034 (Direct) 
(312) 580-1020 (Main) 
(312) 580-9696 (Fax) 

ifliqht@hfzlaw.com 



Important: This electronic mail message and any attached files contain information intended for the exclusive use of the 
individual or entity to whom it is addressed and may contain information that is proprietary, privileged, confidential and/or 
exempt from disclosure under applicable law. If you are not the intended recipient, please notify the sender, by electronic 
mail or telephone, of any unintended recipients and delete the original message without making any copies. 



Best regards, 



Jim 




l 



Techreport 



To obtain a copy of this techreport, please 
look for it at the following site: 

http://www-cse.ucsd.edu/users/calder/papers.html 

Or send email or a letter to: 
Brad Calder 

University of California, San Diego 

9500 Gilman Drive 

La Jolla, CA 92093-0114 

calder@cs.ucsd.edu 
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James A. Flight 



From: 
Sent: 
To: 

Subject: 
Signed By: 



James A. Flight 

Monday, July 16, 2007 11:42 AM 

'calder@cs.ucsd.edu 1 

RE: Request for citation assistance 

jfiig ht@hfzlaw.com 



Sensitivity: 



Confidential 



Brad, 



I was wondering if you have had an opportunity to consider this issue. I would appreciate any assistance you are able to 
provide. 

Thank you 

Jim 



From: James A. Flight 

Sent: Saturday, July 07, 2007 1:45 PM 

To: 'calder@cs. ucsd.edu' 

Subject: Request for citation assistance 

Sensitivity: Confidential 

Brad, 

We are trying to determine when the full text of your article "Phase tracking and Prediction" was available to the public. 
On its face, it appears to have been published in June of 2003. However, this link 

( http://www.cse.ucsd.edU/Dienst/UI/2.0/Describe/ncstri.ucsd cse/CS2002-0710 ) also available through here 
( http://www.cs.ucsd.edU/Dienst/UI/2. Q/ListAuthors/S?authority=ncstrl,ucsd cse ) makes it appear that the article may have 
been available on request to you as early as June of 2002 (see the postscript file attached as a PDF hereto for your 
convenience). However, following the link in the Postscript file leads to this list ( http://www- 
cse.ucsd.edu/users/calder/papers.htmn . which identifies the 2003 publication, and no 2002 publication. 

In view of the foregoing, can you help me understand what, if anything, was available in 20027 I suspect it was merely 
the abstract, but perhaps you had the draft ready and were making it available a year before its official 2003 publication? 

Thanks, in advance, for any assistance you are able to provide. 

Best regards, 

Jim 

James A. Flight 



HANLEY 
kmm FLIGHT* 
ZIMMERMAN™ 

150 South Wacker Drive, Suite 2100 
Chicago, Illinois 60606 

(312) 580-1034 (Direct) 
(312) 580-1020 (Main) 
{312} 580-9696 (Fax) 

ifliqht@hfzlaw.com 




l 



f 



Important: This electronic mail message and any attached files contain information intended for the exclusive use of the 
individual or entity to whom it is addressed and may contain information that is proprietary, privileged, confidential and/or 
exempt from disclosure under applicable law. If you are not the intended recipient, please notify the sender, by electronic 
mail or telephone, of any unintended recipients and delete the original message without making any copies. 
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TO DECLARATION OF 
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James A. Flight 



From: postmaster@hfzlaw.com 

Sent: Monday, Juiy 16, 2007 11:45 AM 

To: James A. Flight 

Subject: Delivery Status Notification (Relay) 

Attachments: ATT802303.txt; RE: Request for citation assistance 



This is an automatically generated Delivery Status Notification. 

Your message has been successfully relayed to the following recipients, but the requested 
delivery status notifications may not be generated by the destination. 

calder(Scs . ucsd . edu 
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EXHIBIT 18 
TO DECLARATION OF 
JAMES A FLIGHT 



James A. Flight 



From: 
Sent: 
To: 

Subject: 



CSE Computing Support [webmaster@cs.ucsd.edu] 
Friday, July 06, 2007 7:48 PM 
James A. Flight 

[website #200731]: request for assistance. 



We had a couple of problems with the techreport server. I kicked it and it seems to be 
working now. That techreport is available either by following the link you gave, or by going 
to 

http : / /www. cs . ucsd . edu/f acresearch/technicalreports/tech reports . html 
From there you can search by Author or by Year and also get to the report. 
Let us know if you have any other problems. 



On Fri, 6 Dul 2007 14:20:22 -0700,, 3Fliehtfahfzlaw.com wrote: 
> Hello, 



> The publication Sherwood et al. Phase Tracking and Prediction, is 

> dated Dune 2003. However, according to the Internet, at least some 

> portion of that article was available on your cite 

> ( www. cse. ucsd. edu/Dienst/UI/2.0/Describe/ncstrl. ucsd cse/CS2002-0710 ) 

> in Dune of 2802. According to the Internet Wayback machine, this page 

> was on-line as of November 19, 2002. Can you assist my research by 

> telling me what exactly was available on your cite in the 202 time 

> frame (e.g., by giving me a copy of the postscript file)? 



> Thank you, 



> Dim 



> Dames A. Flight 



> 150 South Wacker Drive, Suite 2100 

> Chicago, Illinois 60606 
> 

> (312) 580-1034 (Direct) 

> (312) 580-1020 (Main) 



-glenn 
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(312) 580-9696 (Fax) 
iflxghtf3hfzlaw.com 



Important: This electronic mail message and any attached files contain 
information intended for the exclusive use of the individual or entity 
to whom it is addressed and may contain information that is 
proprietary., privileged., confidential and/or exempt from disclosure 
under applicable law. If you are not the intended recipient s please 
notify the sender, by electronic mail or telephone, of any unintended 
recipients and delete the original message without making any copies. 
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